"Data is messy"
We will be performing the following operation on our Onion price to refine it
Other stuff you may need to do to refine are...
In [1]:
library(plyr)
library(dplyr)
library(tidyr)
In [2]:
df <- read.csv('MonthWiseMarketArrivalsAll.csv', stringsAsFactor = FALSE)
In [3]:
str(df)
In [4]:
dim(df)
Out[4]:
In [5]:
column_names <- c('market', 'month', 'year', 'quantity', 'priceMin', 'priceMax', 'priceMod')
In [6]:
colnames(df) <- column_names
In [7]:
head(df)
Out[7]:
In [8]:
class(df)
Out[8]:
In [9]:
str(df)
In [10]:
tail(df,1)
Out[10]:
In [11]:
df <- df %>%
filter( year != "Total")
In [12]:
tail(df)
Out[12]:
In [13]:
dim(df)
Out[13]:
In [14]:
# selecting particular columns from the df
head(df %>%
select(priceMin,priceMax,priceMod))
Out[14]:
In [15]:
str(df)
In [16]:
#we see that all of them are character, we know that they are integers, so lets convert
df$quantity <- as.numeric(df$quantity)
df$year <- as.numeric(df$year)
df$priceMin <- as.numeric(df$priceMin)
df$priceMax <- as.numeric(df$priceMax)
df$priceMod <- as.numeric(df$priceMod)
In [17]:
str(df)
In [18]:
head(df)
Out[18]:
In [19]:
summary(df)
Out[19]:
In [20]:
library(stringr)
In [21]:
# we see that city and state are combined, lets split them up
df$market[1]
strsplit(as.character(df$market[1]),'\\(')[[1]][2]
Out[21]:
Out[21]:
In [22]:
head(df)
Out[22]:
In [25]:
head(df)
Out[25]:
In [24]:
tail(df)
Out[24]:
In [26]:
df <- df %>%
mutate(market1 = market) %>%
separate(market1, c("city", "state"), sep = "\\(")
In [27]:
head(df)
Out[27]:
In [28]:
df$state <- df$state %>% str_replace("\\)","")
In [29]:
head(df)
Out[29]:
In [30]:
head(df %>% filter(city == 'BANGALORE'))
Out[30]:
In [31]:
df <- df %>%
mutate(state = ifelse(is.na(state), market, state))
In [32]:
dim(df)
Out[32]:
In [33]:
tail(df)
Out[33]:
In [34]:
str(df)
In [35]:
unique(df$state)
Out[35]:
In [36]:
changeList <- c('BANGALORE' = 'KNT', 'BHOPAL' = 'MP', 'CHANDIGARH' = 'HR', 'CHENNAI' = 'TN', 'UTT' = 'UP',
'GUWAHATI' = 'ASM', 'HYDERABAD' = 'AP', 'DELHI' = 'DEL', 'JAIPUR' = 'RAJ', 'WHITE' = 'MS', 'JAMMU' = 'JK',
'KOLKATA' = 'WB', 'LUCKNOW' = 'UP', 'MUMBAI' = 'MH','NAGPUR' = 'MH', 'PATNA' = 'BHR',
'CHGARH' = 'HR', 'SHIMLA' = 'HP', 'SRINAGAR' = 'JK', 'TRIVENDRUM'= 'KEL')
In [37]:
df$state <- revalue(df$state, changeList)
In [38]:
unique(df$state)
Out[38]:
In [39]:
df <- df %>%
mutate(date = paste(month, year, sep="-"))
In [40]:
head(df)
Out[40]:
In [41]:
df$date = as.Date(paste("01-",df$date,sep=""), "%d-%B-%Y")
In [42]:
head(df)
Out[42]:
In [43]:
str(df)
In [44]:
write.csv(df, 'MonthWiseMarketArrivals_Clean.csv', row.names = FALSE)
SQL helps you strive towards data independence. Data independence is defined as:
... the independence of application programs and terminal activities from the growth in data types and changes in data representations."
Let us just use sqlite for today, as it is the most simple to setup
In [45]:
install.packages('RSQLite',repos='http://ftp.iitm.ac.in/cran/')
In [46]:
library(RSQLite)
In [47]:
my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
Now let us load the csv into the database
In [48]:
df <- read.csv('MonthWiseMarketArrivalsAll.csv', stringsAsFactor = FALSE)
In [49]:
column_names <- c('market', 'month', 'year', 'quantity', 'priceMin', 'priceMax', 'priceMod')
In [50]:
colnames(df) <- column_names
In [51]:
onion_sqlite <- copy_to(dest=my_db,df=df,name='onion',temporary = FALSE)
In [52]:
tbl(my_db,sql("SELECT * from onion limit 3"))
Out[52]:
In [53]:
#let us select entries from the table
select(onion_sqlite,market,month)
Out[53]:
In [54]:
# now let us select entries which belong to the Market ABOHAR(PB)
filter(onion_sqlite, market =='ABOHAR(PB)')
Out[54]:
In [55]:
# now let us sort on year for ABOHAR(FB)
filter(onion_sqlite, market =='ABOHAR(PB)') %>% arrange(year)
Out[55]:
In [56]:
#lets now create a new column where generate priceMax - priceMin
filter(onion_sqlite, market =='ABOHAR(PB)') %>% mutate(diff = priceMax - priceMin)
Out[56]:
In [57]:
# now let us generate avg of diff
filter(onion_sqlite, market =='ABOHAR(PB)') %>%
mutate(diff = priceMax - priceMin) %>%
summarise(meanDiff = mean(diff))
Out[57]:
When working with databases, dplyr tries to be as lazy as possible. It’s lazy in two ways:
It never pulls data back to R unless you explicitly ask for it.
It delays doing any work until the last possible minute, collecting together everything you want to do then sending that to the database in one step.
In [58]:
m1 <- filter(onion_sqlite, market =='ABOHAR(PB)')
In [59]:
m2 <- mutate(m1,diff = priceMax - priceMin)
In [60]:
m3 <- summarise(m2,meanDiff = mean(diff))
In [61]:
m3
Out[61]:
In [62]:
explain(m3)
In [63]:
tbl(my_db,sql("SELECT * from onion limit 3"))
Out[63]:
In [ ]: