"I don't know, what I don't know"
We want to first visually explore the data to see if we can confirm some of our initial hypotheses as well as make new hypothesis about the problem we are trying to solve.
For this we will start by loading the data and understanding the data structure of the dataframe we have.
In [1]:
sessionInfo()
Out[1]:
In [3]:
install.packages('dplyr',repos='http://ftp.iitm.ac.in/cran')
In [ ]:
install.packages('ggplot2',repos='http://ftp.iitm.ac.in/cran')
In [2]:
# Import the library we need, which is dplyr and ggplot2
library(dplyr)
library(ggplot2)
In [5]:
# Configuring jupyter plotting
options(repr.plot.width=10, repr.plot.height=6)
You will find the variable df
used quite often to store a dataframe
In [6]:
# Read the csv file of Monthwise Quantity and Price csv file we have.
df <- read.csv('MonthWiseMarketArrivals_Clean.csv')
In [7]:
dim(df)
Out[7]:
In [8]:
head(df)
Out[8]:
So we have ten columns in our dataset. Let us understand what each one is.
Three are about the location of the Wholesale Market where Onion where sold.
Three are related to the
Four are about quantity and price in these wholesale market.
We would expect the following the columns to be of the following type
Let us see what pandas dataframe is reading these columns as.
In [9]:
# Get the structure of the data frame
str(df)
So we are getting the quantitive columns are correctly being shown as integers and the categorical columns are showing as objects(strings) which is fine.
However, the date
columns is being read as an object and not a Time-Interval. Let us at least fix the date
column and make it into a datetime object
In [10]:
as.Date("2016-04-01", "%Y-%m-%d")
Out[10]:
In [12]:
# Changing the date column to a Time Interval columnn
df$date <- as.Date(as.character(df$date), "%Y-%m-%d")
In [13]:
# Now checking for type of each column
str(df)
In [14]:
# Let us see the dataframe again now
head(df)
Out[14]:
In [15]:
df2010 <- filter(df, year == 2010)
It is easier to write chained function using the pipe function - %>%
In [16]:
df2010 <- df %>%
filter(year == 2010)
In [60]:
head(df2010)
Out[60]:
In [17]:
# We can also filter on multiple criterias and select only particular columns
df2010Bang <- df %>%
filter((year == 2010) & (city == 'BANGALORE')) %>%
select(market, year, quantity)
In [18]:
head(df2010Bang)
Out[18]:
In [19]:
unique(df$city)
Out[19]:
In [23]:
df_lasalgaon <- df %>% filter(city == 'LASALGAON')
In [24]:
dim(df_lasalgaon)
Out[24]:
group_by
and summarize
)How do we get the sum of quantity for each city.
We need to SPLIT the data by each city, APPLY the sum to the quantity row and then COMBINE the data again
In dplyr, we use the groupby
function to do the grouping and summarize
to the apply part.
In [25]:
# Group by using city
df2010City = df2010 %>%
group_by(city) %>%
summarize(quantity_year = sum(quantity))
In [26]:
head(df2010City)
Out[26]:
In [27]:
# Let us see this dataframe
head(df2010City)
Out[27]:
In [29]:
head(df)
Out[29]:
In [30]:
sum_quantity_df <- df %>%
filter(year==2015) %>%
group_by(state) %>%
summarize(sum_quantity = sum(quantity),avg_price = mean(priceMax))
In [31]:
head(sum_quantity_df)
Out[31]:
In [34]:
# Sort the Dataframe by Quantity to see which one is on top
df2010City <- df2010City %>%
arrange(desc(quantity_year))
In [35]:
head(df2010City)
Out[35]:
In [36]:
df2010CitySmall <- df2010City %>%
filter(quantity_year > 500000) %>%
arrange(desc(quantity_year))
In [37]:
head(df2010CitySmall)
Out[37]:
In [ ]:
In [ ]:
In [ ]:
## PRINCIPLE: Visual Exploration
We will be using ggplot2 for doing visual exploration in R
install.packages('tidyr',repos='http://ftp.iitm.ac.in/cran') install.packages('lubridate',repos='http://ftp.iitm.ac.in/cran') install.packages('stringr',repos='http://ftp.iitm.ac.in/cran') install.packages('rvest',repos='http://ftp.iitm.ac.in/cran')
In [39]:
library(tidyr)
library(lubridate)
library(stringr)
library(rvest)
In [41]:
# Plot the Data
ggplot(df2010CitySmall) +
aes(city,weight = quantity_year) +
geom_bar() +
coord_flip()
In [48]:
head(reorder(df2010CitySmall$city,df2010CitySmall$quantity_year),10)
Out[48]:
In [43]:
# Plot the Data
ggplot(df2010CitySmall) +
aes(reorder(city, quantity_year), weight = quantity_year) +
geom_bar() +
coord_flip()
In [51]:
df2015 <- df %>% filter(year==2015)
In [52]:
ggplot(df2015) +
aes(state,weight=quantity)+
geom_bar() +
coord_flip()
In [55]:
head(df2015)
Out[55]:
In [58]:
df2015_priceMax <- df %>% group_by(state) %>% summarise(maxPrice=max(priceMax))
In [59]:
head(df2015_priceMax)
Out[59]:
In [61]:
ggplot(df2015_priceMax)+
aes(reorder(state,maxPrice),weight=maxPrice)+
geom_bar() +
coord_flip()
In [62]:
head(df)
Out[62]:
In [63]:
dfBang <- df %>% filter(city == 'BANGALORE')
In [64]:
head(dfBang)
Out[64]:
In [65]:
summary(dfBang)
Out[65]:
In [66]:
# Set the index as date
dfBang <- dfBang %>%
arrange(date)
head(dfBang)
Out[66]:
In [67]:
ggplot(dfBang) + aes(date, priceMod) + geom_line()
In [68]:
head(dfBang)
Out[68]:
In [69]:
library(tidyr)
In [71]:
dim(dfBang)
Out[71]:
In [85]:
dfBangTall <- dfBang %>%
gather("priceType", "priceValue",5:7) %>%
arrange(date)
In [86]:
dim(dfBangTall)
Out[86]:
In [87]:
head(dfBangTall)
Out[87]:
In [88]:
ggplot(dfBangTall) + aes(date, y = priceValue, color = priceType) + geom_line()
In [89]:
dfBang <- dfBang %>%
mutate(priceDiff = priceMax - priceMin)
In [90]:
head(dfBang)
Out[90]:
In [95]:
ggplot(dfBang) + aes(date, priceDiff) + geom_line() + geom_point(aes(date,priceMod))
In [96]:
head(dfBang)
Out[96]:
In [97]:
library(lubridate)
In [98]:
# Create new variable for Integer Month
dfBang <- dfBang %>%
mutate(monthVal = month(date))
In [99]:
head(dfBang)
Out[99]:
In [100]:
dfBangGroup <- dfBang %>%
group_by(year, monthVal) %>%
summarize(priceDiff)
In [101]:
head(dfBangGroup)
Out[101]:
In [91]:
str(dfBangGroup)
In [102]:
dfBangPivot <- dfBangGroup %>%
spread(monthVal, priceDiff)
In [103]:
dfBangPivot <- dfBang %>%
group_by(year, monthVal) %>%
summarize(priceDiff) %>%
spread(monthVal, priceDiff)
In [104]:
head(dfBangPivot)
Out[104]:
In [108]:
ggplot(dfBang) + aes(monthVal, weight = priceDiff) + geom_bar() + facet_wrap(~year)
In [ ]:
In [ ]:
In [ ]: