4. Explore the Data

"I don't know, what I don't know"

We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.

For this we will start by loading the data and understanding the data structure of the dataframe we have.

Lets read the data


In [1]:
sessionInfo()


Out[1]:
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-apple-darwin15.3.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] magrittr_1.5    IRdisplay_0.3   tools_3.2.4     base64enc_0.1-3
 [5] uuid_0.1-2      stringi_1.0-1   rzmq_0.7.7      IRkernel_0.5   
 [9] jsonlite_0.9.19 stringr_1.0.0   digest_0.6.8    repr_0.4       
[13] evaluate_0.8   

In [3]:
install.packages('dplyr',repos='http://ftp.iitm.ac.in/cran')


Installing package into '/usr/local/lib/R/3.2/site-library'
(as 'lib' is unspecified)
The downloaded source packages are in
	'/private/var/folders/04/r20f0_4n2m7cv23lr8t97wp00000gn/T/RtmpRVFUJm/downloaded_packages'

In [ ]:
install.packages('ggplot2',repos='http://ftp.iitm.ac.in/cran')

In [2]:
# Import the library we need, which is dplyr and ggplot2
library(dplyr)
library(ggplot2)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


In [5]:
# Configuring jupyter plotting
options(repr.plot.width=10, repr.plot.height=6)

You will find the variable df used quite often to store a dataframe


In [6]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df <- read.csv('MonthWiseMarketArrivals_Clean.csv')

Understand Data Structure and Types


In [7]:
dim(df)


Out[7]:
  1. 10320
  2. 10

In [8]:
head(df)


Out[8]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1ABOHAR(PB)January20052350404493446ABOHARPB2005-01-01
2ABOHAR(PB)January2006900487638563ABOHARPB2006-01-01
3ABOHAR(PB)January2010790128315921460ABOHARPB2010-01-01
4ABOHAR(PB)January2011245306737503433ABOHARPB2011-01-01
5ABOHAR(PB)January20121035523686605ABOHARPB2012-01-01
6ABOHAR(PB)January2013675132719001605ABOHARPB2013-01-01

Data Structure

So we have ten columns in our dataset. Let us understand what each one is.

Three are about the location of the Wholesale Market where Onion where sold.

  • state: This is the 2/3 letter abbreviation for the state in India (PB is Punjab and so on)
  • city: This is the city in India (ABOHAR, BANGALORE and so on)
  • market: This is a string with the combination of the state and city

Three are related to the

  • month: Month in January, February and so on.
  • year: Year in YYYY representastion
  • date: The combination of the two above.

Four are about quantity and price in these wholesale market.

  • quantity: The quanity of Onion arriving in the market in that month in quintals (100 kg)
  • priceMin: The minimum price in the month in Rs./quintal
  • priceMax: The maximum price in the month in Rs./quintal
  • priceMod: The modal price in the month in Rs./quintal

We would expect the following the columns to be of the following type

  • CATEGORICAL: state, city, market
  • TIME INTERVAL: month, year, date
  • QUANTITATIVE: quantity, priceMin, priceMax, priceModal

Let us see what pandas dataframe is reading these columns as.


In [9]:
# Get the structure of the data frame
str(df)


'data.frame':	10320 obs. of  10 variables:
 $ market  : Factor w/ 122 levels "ABOHAR(PB)","AGRA(UP)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ month   : Factor w/ 12 levels "April","August",..: 5 5 5 5 5 5 5 5 4 4 ...
 $ year    : int  2005 2006 2010 2011 2012 2013 2014 2015 2005 2006 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: int  404 487 1283 3067 523 1327 1025 1309 286 343 ...
 $ priceMax: int  493 638 1592 3750 686 1900 1481 1858 365 411 ...
 $ priceMod: int  446 563 1460 3433 605 1605 1256 1613 324 380 ...
 $ city    : Factor w/ 119 levels "ABOHAR","AGRA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ state   : Factor w/ 22 levels "AP","ASM","BHR",..: 17 17 17 17 17 17 17 17 17 17 ...
 $ date    : Factor w/ 243 levels "1996-01-01","1996-02-01",..: 109 121 169 181 193 205 217 229 110 122 ...

So we are getting the quantitive columns are correctly being shown as integers and the categorical columns are showing as objects(strings) which is fine. However, the date columns is being read as an object and not a Time-Interval. Let us at least fix the date column and make it into a datetime object


In [10]:
as.Date("2016-04-01", "%Y-%m-%d")


Out[10]:
[1] "2016-04-01"

In [12]:
# Changing the date column to a Time Interval columnn
df$date <- as.Date(as.character(df$date), "%Y-%m-%d")

In [13]:
# Now checking for type of each column
str(df)


'data.frame':	10320 obs. of  10 variables:
 $ market  : Factor w/ 122 levels "ABOHAR(PB)","AGRA(UP)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ month   : Factor w/ 12 levels "April","August",..: 5 5 5 5 5 5 5 5 4 4 ...
 $ year    : int  2005 2006 2010 2011 2012 2013 2014 2015 2005 2006 ...
 $ quantity: int  2350 900 790 245 1035 675 440 1305 1400 1800 ...
 $ priceMin: int  404 487 1283 3067 523 1327 1025 1309 286 343 ...
 $ priceMax: int  493 638 1592 3750 686 1900 1481 1858 365 411 ...
 $ priceMod: int  446 563 1460 3433 605 1605 1256 1613 324 380 ...
 $ city    : Factor w/ 119 levels "ABOHAR","AGRA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ state   : Factor w/ 22 levels "AP","ASM","BHR",..: 17 17 17 17 17 17 17 17 17 17 ...
 $ date    : Date, format: "2005-01-01" "2006-01-01" ...

In [14]:
# Let us see the dataframe again now
head(df)


Out[14]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1ABOHAR(PB)January20052350404493446ABOHARPB2005-01-01
2ABOHAR(PB)January2006900487638563ABOHARPB2006-01-01
3ABOHAR(PB)January2010790128315921460ABOHARPB2010-01-01
4ABOHAR(PB)January2011245306737503433ABOHARPB2011-01-01
5ABOHAR(PB)January20121035523686605ABOHARPB2012-01-01
6ABOHAR(PB)January2013675132719001605ABOHARPB2013-01-01

Question 1 - How big is the Bangalore onion market compared to other cities in India?

Let us try to do this examination for one of the year only. So we want to reduce our dataframe for only where the year = 2010. This process is called subset.

PRINCIPLE: filter for rows and/or select columns in a dataframe

  • verb: filter for rows

  • verb: select for columns


In [15]:
df2010 <- filter(df, year == 2010)

It is easier to write chained function using the pipe function - %>%


In [16]:
df2010 <- df %>% 
          filter(year == 2010)

In [60]:
head(df2010)


Out[60]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1ABOHAR(PB)January2010790128315921460ABOHARPB2010-01-01
2ABOHAR(PB)February2010555114314601322ABOHARPB2010-02-01
3ABOHAR(PB)March2010385510878688ABOHARPB2010-03-01
4ABOHAR(PB)April2010840466755611ABOHARPB2010-04-01
5ABOHAR(PB)May20102050391578494ABOHARPB2010-05-01
6ABOHAR(PB)June20102075363515460ABOHARPB2010-06-01

In [17]:
# We can also filter on multiple criterias and select only particular columns
df2010Bang <- df %>% 
              filter((year == 2010) & (city == 'BANGALORE')) %>%
              select(market, year, quantity)

In [18]:
head(df2010Bang)


Out[18]:
marketyearquantity
1BANGALORE2010423649
2BANGALORE2010316685
3BANGALORE2010368644
4BANGALORE2010404805
5BANGALORE2010395519
6BANGALORE2010362618

Exercise: Filter for market Lasalgaon and select on market and price columns


In [19]:
unique(df$city)


Out[19]:
  1. ABOHAR
  2. AGRA
  3. AHMEDABAD
  4. AHMEDNAGAR
  5. AJMER
  6. ALIGARH
  7. ALWAR
  8. AMRITSAR
  9. BALLIA
  10. BANGALORE
  11. BAREILLY
  12. BELGAUM
  13. BHATINDA
  14. BHAVNAGAR
  15. BHOPAL
  16. BHUBNESWER
  17. BIHARSHARIF
  18. BIJAPUR
  19. BIKANER
  20. BOMBORI
  21. BURDWAN
  22. CHAKAN
  23. CHALLAKERE
  24. CHANDIGARH
  25. CHANDVAD
  26. CHENNAI
  27. CHICKBALLAPUR
  28. COIMBATORE
  29. DEESA
  30. DEHRADOON
  31. DELHI
  32. DEORIA
  33. DEVALA
  34. DEWAS
  35. DHAVANGERE
  36. DHULIA
  37. DINDIGUL
  38. DINDORI
  39. ETAWAH
  40. FARUKHABAD
  41. GONDAL
  42. GORAKHPUR
  43. GUWAHATI
  44. HALDWANI
  45. HASSAN
  46. HOSHIARPUR
  47. HUBLI
  48. HYDERABAD
  49. INDORE
  50. JAIPUR
  51. JALANDHAR
  52. JALGAON
  53. JAMMU
  54. JAMNAGAR
  55. JHANSI
  56. JODHPUR
  57. JUNNAR
  58. KALVAN
  59. KANPUR
  60. KARNAL
  61. KHANNA
  62. KOLAR
  63. KOLHAPUR
  64. KOLKATA
  65. KOPERGAON
  66. KOTA
  67. KURNOOL
  68. LASALGAON
  69. LONAND
  70. LUCKNOW
  71. LUDHIANA
  72. MADURAI
  73. MAHUVA
  74. MALEGAON
  75. MANDSOUR
  76. MANMAD
  77. MEERUT
  78. MIDNAPUR
  79. MUMBAI
  80. NAGPUR
  81. NANDGAON
  82. NASIK
  83. NEEMUCH
  84. NEWASA
  85. NIPHAD
  86. PALAYAM
  87. PATIALA
  88. PATNA
  89. PHALTAN
  90. PIMPALGAON
  91. PUNE
  92. PURULIA
  93. RAHATA
  94. RAHURI
  95. RAICHUR
  96. RAIPUR
  97. RAJAHMUNDRY
  98. RAJKOT
  99. RANCHI
  100. SAGAR
  101. SAIKHEDA
  102. SANGALI
  103. SANGAMNER
  104. SATANA
  105. SHEROAPHULY
  106. SHIMLA
  107. SHRIRAMPUR
  108. SINNAR
  109. SOLAPUR
  110. SRIGANGANAGAR
  111. SRINAGAR
  112. SRIRAMPUR
  113. SURAT
  114. TRIVENDRUM
  115. UDAIPUR
  116. UJJAIN
  117. VANI
  118. VARANASI
  119. YEOLA

In [23]:
df_lasalgaon <- df %>% filter(city == 'LASALGAON')

In [24]:
dim(df_lasalgaon)


Out[24]:
  1. 243
  2. 10

Principle: Split Apply Combine (use group_by and summarize)

How do we get the sum of quantity for each city.

We need to SPLIT the data by each city, APPLY the sum to the quantity row and then COMBINE the data again

In dplyr, we use the groupby function to do the grouping and summarize to the apply part.


In [25]:
# Group by using city
df2010City = df2010 %>%
             group_by(city) %>%
             summarize(quantity_year = sum(quantity))

In [26]:
head(df2010City)


Out[26]:
cityquantity_year
1ABOHAR11835
2AGRA756755
3AHMEDABAD1135418
4AHMEDNAGAR1678032
5ALWAR561145
6AMRITSAR114417

In [27]:
# Let us see this dataframe
head(df2010City)


Out[27]:
cityquantity_year
1ABOHAR11835
2AGRA756755
3AHMEDABAD1135418
4AHMEDNAGAR1678032
5ALWAR561145
6AMRITSAR114417

Exercise: Find the sum of quantity for 2015 for each state


In [29]:
head(df)


Out[29]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1ABOHAR(PB)January20052350404493446ABOHARPB2005-01-01
2ABOHAR(PB)January2006900487638563ABOHARPB2006-01-01
3ABOHAR(PB)January2010790128315921460ABOHARPB2010-01-01
4ABOHAR(PB)January2011245306737503433ABOHARPB2011-01-01
5ABOHAR(PB)January20121035523686605ABOHARPB2012-01-01
6ABOHAR(PB)January2013675132719001605ABOHARPB2013-01-01

In [30]:
sum_quantity_df <- df %>% 
filter(year==2015) %>% 
group_by(state) %>% 
summarize(sum_quantity = sum(quantity),avg_price = mean(priceMax))

In [31]:
head(sum_quantity_df)


Out[31]:
statesum_quantityavg_price
1AP23246182458
2ASM360132986.917
3BHR4768002603.643
4DEL32721392543.083
5GUJ97524602151.518
6HR1460282464.435

PRINCIPLE: Arrange the rows

For sorting the variables we need to use the verb - arrange


In [34]:
# Sort the Dataframe by Quantity to see which one is on top
df2010City <- df2010City %>%
              arrange(desc(quantity_year))

In [35]:
head(df2010City)


Out[35]:
cityquantity_year
1BANGALORE6079067
2DELHI3508582
3KOLKATA3495320
4PUNE3326024
5SOLAPUR3310419
6MUMBAI2921005

In [36]:
df2010CitySmall <- df2010City %>% 
                   filter(quantity_year > 500000) %>%
                   arrange(desc(quantity_year))

In [37]:
head(df2010CitySmall)


Out[37]:
cityquantity_year
1BANGALORE6079067
2DELHI3508582
3KOLKATA3495320
4PUNE3326024
5SOLAPUR3310419
6MUMBAI2921005

Exercise: Sort the sum of quantity for 2015 for each state in descending order


In [ ]:


In [ ]:


In [ ]:
## PRINCIPLE: Visual Exploration 
We will be using ggplot2 for doing visual exploration in R

Packages to be installed for Mac users

install.packages('tidyr',repos='http://ftp.iitm.ac.in/cran') install.packages('lubridate',repos='http://ftp.iitm.ac.in/cran') install.packages('stringr',repos='http://ftp.iitm.ac.in/cran') install.packages('rvest',repos='http://ftp.iitm.ac.in/cran')


In [39]:
library(tidyr)
library(lubridate)
library(stringr)
library(rvest)


Loading required package: xml2

PRINCIPLE: Visual Exploration

We will be using ggplot2 for doing visual exploration in R


In [41]:
# Plot the Data
ggplot(df2010CitySmall) + 
aes(city,weight = quantity_year) + 
geom_bar() + 
coord_flip()



In [48]:
head(reorder(df2010CitySmall$city,df2010CitySmall$quantity_year),10)


Out[48]:
  1. BANGALORE
  2. DELHI
  3. KOLKATA
  4. PUNE
  5. SOLAPUR
  6. MUMBAI
  7. PIMPALGAON
  8. MAHUVA
  9. LASALGAON
  10. MALEGAON

In [43]:
# Plot the Data
ggplot(df2010CitySmall) + 
aes(reorder(city, quantity_year), weight = quantity_year) + 
geom_bar() +
coord_flip()


Exercise: Show the State with Quantity Sales in 2015?


In [51]:
df2015 <- df %>% filter(year==2015)

In [52]:
ggplot(df2015) +
aes(state,weight=quantity)+
geom_bar() +
coord_flip()


Exercise: Show the State with Highest Price in 2015?


In [55]:
head(df2015)


Out[55]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1ABOHAR(PB)January20151305130918581613ABOHARPB2015-01-01
2ABOHAR(PB)February20151115120019461688ABOHARPB2015-02-01
3ABOHAR(PB)March2015920126019801745ABOHARPB2015-03-01
4ABOHAR(PB)May2015940102016201310ABOHARPB2015-05-01
5ABOHAR(PB)June201561095718291457ABOHARPB2015-06-01
6ABOHAR(PB)July2015795100820001517ABOHARPB2015-07-01

In [58]:
df2015_priceMax <- df %>% group_by(state) %>% summarise(maxPrice=max(priceMax))

In [59]:
head(df2015_priceMax)


Out[59]:
statemaxPrice
1AP5305
2ASM5733
3BHR5504
4DEL5181
5GUJ5286
6HP2200

In [61]:
ggplot(df2015_priceMax)+
aes(reorder(state,maxPrice),weight=maxPrice)+
geom_bar() +
coord_flip()


Question 2 - Has the price variation in Onion prices in Bangalore really gone up over the years?


In [62]:
head(df)


Out[62]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1ABOHAR(PB)January20052350404493446ABOHARPB2005-01-01
2ABOHAR(PB)January2006900487638563ABOHARPB2006-01-01
3ABOHAR(PB)January2010790128315921460ABOHARPB2010-01-01
4ABOHAR(PB)January2011245306737503433ABOHARPB2011-01-01
5ABOHAR(PB)January20121035523686605ABOHARPB2012-01-01
6ABOHAR(PB)January2013675132719001605ABOHARPB2013-01-01

In [63]:
dfBang <- df %>% filter(city == 'BANGALORE')

In [64]:
head(dfBang)


Out[64]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1BANGALOREJanuary20042278329161066991BANGALOREKNT2004-01-01
2BANGALOREJanuary2005335679470597522BANGALOREKNT2005-01-01
3BANGALOREJanuary2006412185286617537BANGALOREKNT2006-01-01
4BANGALOREJanuary20072682685861167942BANGALOREKNT2007-01-01
5BANGALOREJanuary2008393806174671472BANGALOREKNT2008-01-01
6BANGALOREJanuary200937438084815541328BANGALOREKNT2009-01-01

In [65]:
summary(dfBang)


Out[65]:
            market         month         year         quantity      
 BANGALORE     :147   February:13   Min.   :2004   Min.   :  63824  
 ABOHAR(PB)    :  0   January :13   1st Qu.:2007   1st Qu.: 329750  
 AGRA(UP)      :  0   March   :13   Median :2010   Median : 405716  
 AHMEDABAD(GUJ):  0   April   :12   Mean   :2010   Mean   : 523630  
 AHMEDNAGAR(MS):  0   August  :12   3rd Qu.:2013   3rd Qu.: 660674  
 AJMER(RAJ)    :  0   December:12   Max.   :2016   Max.   :1639032  
 (Other)       :  0   (Other) :72                                   
    priceMin         priceMax       priceMod            city         state    
 Min.   : 145.0   Min.   : 338   Min.   : 320   BANGALORE :147   KNT    :147  
 1st Qu.: 306.5   1st Qu.: 687   1st Qu.: 551   ABOHAR    :  0   AP     :  0  
 Median : 441.0   Median :1021   Median : 828   AGRA      :  0   ASM    :  0  
 Mean   : 555.2   Mean   :1312   Mean   :1041   AHMEDABAD :  0   BHR    :  0  
 3rd Qu.: 651.0   3rd Qu.:1612   3rd Qu.:1323   AHMEDNAGAR:  0   DEL    :  0  
 Max.   :2377.0   Max.   :4698   Max.   :3430   AJMER     :  0   GUJ    :  0  
                                                (Other)   :  0   (Other):  0  
      date           
 Min.   :2004-01-01  
 1st Qu.:2007-01-16  
 Median :2010-02-01  
 Mean   :2010-01-30  
 3rd Qu.:2013-02-15  
 Max.   :2016-03-01  
                     

In [66]:
# Set the index as date
dfBang <- dfBang %>% 
          arrange(date)
head(dfBang)


Out[66]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1BANGALOREJanuary20042278329161066991BANGALOREKNT2004-01-01
2BANGALOREFebruary2004225133741870793BANGALOREKNT2004-02-01
3BANGALOREMarch2004221952527586556BANGALOREKNT2004-03-01
4BANGALOREApril2004185150419518465BANGALOREKNT2004-04-01
5BANGALOREMay2004137390400516455BANGALOREKNT2004-05-01
6BANGALOREJune2004311445486621551BANGALOREKNT2004-06-01

In [67]:
ggplot(dfBang) + aes(date, priceMod) + geom_line()


PRINCIPLE: Convert from Wide format to Tall format using gather

Many times during exploration, we will need to convert the data frame from wide format to tall format (and vice versa).


In [68]:
head(dfBang)


Out[68]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedate
1BANGALOREJanuary20042278329161066991BANGALOREKNT2004-01-01
2BANGALOREFebruary2004225133741870793BANGALOREKNT2004-02-01
3BANGALOREMarch2004221952527586556BANGALOREKNT2004-03-01
4BANGALOREApril2004185150419518465BANGALOREKNT2004-04-01
5BANGALOREMay2004137390400516455BANGALOREKNT2004-05-01
6BANGALOREJune2004311445486621551BANGALOREKNT2004-06-01

In [69]:
library(tidyr)

In [71]:
dim(dfBang)


Out[71]:
  1. 147
  2. 10

In [85]:
dfBangTall <- dfBang %>%
              gather("priceType", "priceValue",5:7) %>%
              arrange(date)

In [86]:
dim(dfBangTall)


Out[86]:
  1. 441
  2. 9

In [87]:
head(dfBangTall)


Out[87]:
marketmonthyearquantitycitystatedatepriceTypepriceValue
1BANGALOREJanuary2004227832BANGALOREKNT2004-01-01priceMin916
2BANGALOREJanuary2004227832BANGALOREKNT2004-01-01priceMax1066
3BANGALOREJanuary2004227832BANGALOREKNT2004-01-01priceMod991
4BANGALOREFebruary2004225133BANGALOREKNT2004-02-01priceMin741
5BANGALOREFebruary2004225133BANGALOREKNT2004-02-01priceMax870
6BANGALOREFebruary2004225133BANGALOREKNT2004-02-01priceMod793

In [88]:
ggplot(dfBangTall) + aes(date, y = priceValue, color = priceType) + geom_line()


PRINCIPLE: Create new variables using mutate

To calculate the range of change, we will create a new price difference variable - which is the difference between the priceMin and priceMax


In [89]:
dfBang <- dfBang %>% 
          mutate(priceDiff = priceMax - priceMin)

In [90]:
head(dfBang)


Out[90]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedatepriceDiff
1BANGALOREJanuary20042278329161066991BANGALOREKNT2004-01-01150
2BANGALOREFebruary2004225133741870793BANGALOREKNT2004-02-01129
3BANGALOREMarch2004221952527586556BANGALOREKNT2004-03-0159
4BANGALOREApril2004185150419518465BANGALOREKNT2004-04-0199
5BANGALOREMay2004137390400516455BANGALOREKNT2004-05-01116
6BANGALOREJune2004311445486621551BANGALOREKNT2004-06-01135

In [95]:
ggplot(dfBang) + aes(date, priceDiff) + geom_line() + geom_point(aes(date,priceMod))


PRINCIPLE: Pivot Table

Pivot table is a way to summarize data frame data into rows, columns and value


In [96]:
head(dfBang)


Out[96]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedatepriceDiff
1BANGALOREJanuary20042278329161066991BANGALOREKNT2004-01-01150
2BANGALOREFebruary2004225133741870793BANGALOREKNT2004-02-01129
3BANGALOREMarch2004221952527586556BANGALOREKNT2004-03-0159
4BANGALOREApril2004185150419518465BANGALOREKNT2004-04-0199
5BANGALOREMay2004137390400516455BANGALOREKNT2004-05-01116
6BANGALOREJune2004311445486621551BANGALOREKNT2004-06-01135

In [97]:
library(lubridate)

In [98]:
# Create new variable for Integer Month
dfBang <- dfBang %>%
          mutate(monthVal = month(date))

In [99]:
head(dfBang)


Out[99]:
marketmonthyearquantitypriceMinpriceMaxpriceModcitystatedatepriceDiffmonthVal
1BANGALOREJanuary20042278329161066991BANGALOREKNT2004-01-011501
2BANGALOREFebruary2004225133741870793BANGALOREKNT2004-02-011292
3BANGALOREMarch2004221952527586556BANGALOREKNT2004-03-01593
4BANGALOREApril2004185150419518465BANGALOREKNT2004-04-01994
5BANGALOREMay2004137390400516455BANGALOREKNT2004-05-011165
6BANGALOREJune2004311445486621551BANGALOREKNT2004-06-011356

In [100]:
dfBangGroup <- dfBang %>%
               group_by(year, monthVal) %>% 
               summarize(priceDiff)

In [101]:
head(dfBangGroup)


Out[101]:
yearmonthValpriceDiff
120041150
220042129
32004359
42004499
520045116
620046135

In [91]:
str(dfBangGroup)


Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	147 obs. of  3 variables:
 $ year     : int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
 $ monthVal : num  1 2 3 4 5 6 7 8 9 10 ...
 $ priceDiff: int  150 129 59 99 116 135 167 145 98 111 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol year
 - attr(*, "drop")= logi TRUE

PRINCIPLE: Convert from Tall format to Wide format using spread

Many times during exploration, we will need to convert the data frame from tall format to wide format.


In [102]:
dfBangPivot <- dfBangGroup %>%
               spread(monthVal, priceDiff)

In [103]:
dfBangPivot <- dfBang %>%
               group_by(year, monthVal) %>% 
               summarize(priceDiff) %>%
               spread(monthVal, priceDiff)

In [104]:
head(dfBangPivot)


Out[104]:
year123456789101112
12004150129599911613516714598111120177
220051271107975561071761692196021131521
32006331209147169142157189192290319337189
42007581611398181196159248381382914824643
52008497373334274331350348317537588604604
62009706663387341288466374379495125916161914

In [108]:
ggplot(dfBang) + aes(monthVal, weight = priceDiff) + geom_bar() + facet_wrap(~year)



In [ ]:

Exercise: Find the price variation for LASALGAON city?


In [ ]:


In [ ]: