read.csv

utils package tự động load khi R session khởi động, bạn có thể import CSV files bằng hàm read.csv().

Sử dụng read.csv() để import file "swimming_pools.csv" thành data frame pools.
In cấu trúc data frame bằng hàm str().



In [1]:

    
pools <- read.csv('swimming_pools.csv')



In [2]:

    
# Check the structure of pools
str(pools)









    



'data.frame':	20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "100 Edmonstone Street, South Brisbane",..: 4 20 18 10 8 11 5 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...

stringsAsFactors

stringsAsFactors cho R biết khi nào nên convert dữ liệu thành dạng Factor, trong tất cả các hàm import của utils, mặc định là stringsAsFactors = TRUE

Nếu ta set stringsAsFactors = FALSE thì tất cả dữ liệu được import sẽ giữ nguyên là string, có kiểu là character



In [3]:

    
# Import swimming_pools.csv correctly: pools
pools <- read.csv('swimming_pools.csv', stringsAsFactors=FALSE)



In [4]:

    
# Check the structure of pools
str(pools)









    



'data.frame':	20 obs. of  4 variables:
 $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...

read.delim & read.table

read.delim() đọc dữ liệu bất kỳ thành data table, dựa vào dấu phân cách sep mà tách các cột.
read.table đọc bất kỳ dữ liệu nào có dạng tabular.

Hai hàm này tương đối giống nhau, read.table mặc định tham số head = FALSE (lấy dòng đầu làm header name) và sep="".

Ghi chú: head() dùng để hiển thị n dòng đầu tiên của data.table

summary

Sử dụng hàm summary() để thống kê miêu tả nhanh về data frame.



In [22]:

    
# Header k có ở dòng đầu, nên mình set header=FALSE
hotdogs <- read.delim('hotdogs.txt', sep='\t', header=FALSE)
head(hotdogs)









    





V1 V2 V3

	Beef 186 495 
	Beef 181 477 
	Beef 176 425 
	Beef 149 322 
	Beef 184 482 
	Beef 190 587



In [23]:

    
summary(hotdogs)









    





       V1           V2              V3       
 Beef   :20   Min.   : 86.0   Min.   :144.0  
 Meat   :17   1st Qu.:132.0   1st Qu.:362.5  
 Poultry:17   Median :145.0   Median :405.0  
              Mean   :145.4   Mean   :424.8  
              3rd Qu.:172.8   3rd Qu.:503.5  
              Max.   :195.0   Max.   :645.0

file.path()

dùng file.path() để tạo đường dẫn.



In [25]:

    
path <- file.path("data", "hotdogs.txt")
path









    




'data/hotdogs.txt'



In [26]:

    
# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table(path, 
                      sep = "\t", 
                      col.names = c("type", "calories", "sodium"))

# Call head() on hotdogs
head(hotdogs)









    





type calories sodium

	Beef 186 495 
	Beef 181 477 
	Beef 176 425 
	Beef 149 322 
	Beef 184 482 
	Beef 190 587

Lọc which.min và which.max

Tìm ra hotdogs có ít calories nhất, nhiều sodium nhất.



In [28]:

    
min.calo <- hotdogs[which.min(hotdogs$calories), ]
min.calo









    





type calories sodium

	50 Poultry 86     358



In [29]:

    
max.sodium <- hotdogs[which.max(hotdogs$sodium), ]
max.sodium









    





type calories sodium

	15 Beef 190 645

colClasses

Tham số colClasses là một vector, nếu một cột nào đó có giá trị NULL trong vector này thì cột đó sẽ bị xóa đi trong data.frame



In [30]:

    
# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim("hotdogs.txt", header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL", "numeric"))


# Display structure of hotdogs2
str(hotdogs2)









    



'data.frame':	54 obs. of  2 variables:
 $ type  : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sodium: num  495 477 425 322 482 587 370 322 479 375 ...



In [ ]:



In [ ]:

V1	V2	V3
Beef	186	495
Beef	181	477
Beef	176	425
Beef	149	322
Beef	184	482
Beef	190	587