In [1]:
library("ggplot2")
library("sqldf")
In [2]:
data=read.csv("BANK_LOC_ALL_AG.csv",sep=";",encoding='UTF-8')
Let's take a look of this table:
In [3]:
head(data,3)
Let's take a look of the names of the cities covered in this table:
In [4]:
unique( data$地區 )
Wierdly, that the Taipei city is not included.
In [5]:
data$年月
From the above, we see that the data is gathered within the period of 04/2016 - 09/2016.
Now, let's see the column names of our data:
In [6]:
colnames(data)
cnames=colnames(data)
length(cnames)
The above header looks horrible. There are 198 columns! It looks like a joined table. Anyway, all the information we need are within this one. So, Let's try to retrieve the information we want. (Although it is possible that this table can be split. But I am lazy to evaluate whether to do this step at this moment)
In [89]:
dat=read.csv("BANK_LOC_ALL_AG_header_removed.csv",sep=";",header=FALSE,encoding='UTF-8')
In [90]:
head(dat,5)
Let's do a SQL query: age under 30, different cities, # times paid by the credit card,amount of money(NTD) (in 6 months:2016-04 to 2016-09)
In [91]:
cnames[3:44]
In [92]:
datAgeUnder30=sqldf("select V1 as 'date',V2 as 'city',V3+V5+V7+V9+V11+V13+V15+V17+V19+V21+V23+V25+V27+V29+V31+V33+V35+V37+V39+V41+V43 as 'NumPays', V4+V6+V8+V10+V12+V14+V16+V18+V20+V22+V24+V26+V28+V30+V32+V34+V36+V38+V40+V42+V44 as 'amount' from dat")
In [93]:
head(datAgeUnder30,12)
The above table is independent of category of consumption and is generated from people with age <30.
I have wierd Chinese issues on my Mac. I tried utf-8 encoding but still in vain.
In order to solve this temporarily, let's translate the above table into English. (hard work & stupid)
In [94]:
datAgeUnder30$date[ datAgeUnder30$date=="105年04月" ] = "10504"
datAgeUnder30$date[ datAgeUnder30$date=="105年05月" ] = "10505"
datAgeUnder30$date[ datAgeUnder30$date=="105年06月" ] = "10506"
datAgeUnder30$date[ datAgeUnder30$date=="105年07月" ] = "10507"
datAgeUnder30$date[ datAgeUnder30$date=="105年08月" ] = "10508"
datAgeUnder30$date[ datAgeUnder30$date=="105年09月" ] = "10509"
In [96]:
unique(datAgeUnder30$city)
In [97]:
datAgeUnder30$city[ datAgeUnder30$city=="基隆市" ] = "KeelungCty"
datAgeUnder30$city[ datAgeUnder30$city=="新竹市" ] = "HsingzuCty"
datAgeUnder30$city[ datAgeUnder30$city=="新竹縣" ] = "HsingzuCny"
datAgeUnder30$city[ datAgeUnder30$city=="苗栗縣" ] = "MiaoliCny"
datAgeUnder30$city[ datAgeUnder30$city=="彰化縣" ] = "ChanghwaCny"
datAgeUnder30$city[ datAgeUnder30$city=="南投縣" ] = "NantouCny"
datAgeUnder30$city[ datAgeUnder30$city=="雲林縣" ] = "YunlinCny"
datAgeUnder30$city[ datAgeUnder30$city=="嘉義市" ] = "ChiayiCty"
datAgeUnder30$city[ datAgeUnder30$city=="嘉義縣" ] = "ChiayiCny"
datAgeUnder30$city[ datAgeUnder30$city=="屏東縣" ] = "PingTongCny"
datAgeUnder30$city[ datAgeUnder30$city=="宜蘭縣" ] = "YeelanCny"
datAgeUnder30$city[ datAgeUnder30$city=="花蓮縣" ] = "HwalianCny"
datAgeUnder30$city[ datAgeUnder30$city=="台東縣" ] = "TaitungCny"
datAgeUnder30$city[ datAgeUnder30$city=="澎湖縣" ] = "PenghuCny"
datAgeUnder30$city[ datAgeUnder30$city=="金門縣" ] = "KingmenCny"
datAgeUnder30$city[ datAgeUnder30$city=="連江縣" ] = "LianchiangCny"
In [99]:
ggplot ( datAgeUnder30 , aes ( x = amount , y = NumPays,colour=date )) + geom_point ()
Number of times paid by the credit card is in general proportional to the amount of money paid by the credit card, in a linear way.
In [115]:
ggplot ( datAgeUnder30 , aes ( x = date , y = NumPays , fill = date )) + geom_boxplot ()
In [116]:
ggplot ( datAgeUnder30 , aes ( x = date , y = amount , fill = date )) + geom_boxplot ()
The total amount money paid by credit card is slighly skewed on July.
Do wealthy people buy more on July?
Now, let us sum over 6 months and get the table:
(city, total amount of times paid by the credit card, total amount of money paid by the credit card)
in ascending order.
In [125]:
datAgeUnder30GroupedbyCity=sqldf("select city, SUM(NumPays) as 'NumPays', SUM(amount) as 'amount' from datAgeUnder30 group by city order by NumPays")
In [127]:
dim(datAgeUnder30GroupedbyCity)
In [128]:
datAgeUnder30GroupedbyCity$index=c(1:16)
In [129]:
datAgeUnder30GroupedbyCity
In [130]:
ggplot ( datAgeUnder30GroupedbyCity , aes ( x = index , y = NumPays ) ) + geom_bar(stat="identity",position="identity",width=0.5,fill="blue")
In [131]:
print(datAgeUnder30GroupedbyCity$city)
In [132]:
ggplot ( datAgeUnder30GroupedbyCity , aes ( x = index , y = amount ) ) + geom_bar(stat="identity",position="identity",width=0.5,fill="blue")
In [133]:
print(datAgeUnder30GroupedbyCity$city)