In [21]:
## Load data
pt<-read.csv("PlantTrait.csv", header=T,row.names=1,sep=";")
length(pt)
In [22]:
head(pt)
This dataset contains 88 variables (incl. several versions for many variables).
We don't need all of them. Let's clean a bit.
In [17]:
## Give proper class to variables
pt$FrLe2<-as.numeric(pt$FrLe2)
pt$FrWi2<-as.numeric(pt$FrWi2)
pt$SeWi2<-as.numeric(pt$SeWi2)
pt$SeLe2<-as.numeric(pt$SeLe2)
## Log-transform data to approach a normal distribution
pt$log_FrLe<-log(pt$FrLe2)
pt$log_FrWi<-log(pt$FrWi2)
pt$log_SeLe<-log(pt$SeLe2)
pt$log_SeWi<-log(pt$SeWi2)
In [18]:
## Keep only endemic species
pt2 <-subset(pt, Origin=="E")
## Create file with variables needed for analyses (+ potential important predictors for imputation)
pt3 <-subset(pt2,select=c(Gen,Fam,Clade2,Lform3, FrType2, FrClass2, Dcol2, FrHusk2, Pulp2, log_FrLe, log_FrWi, SeFr, log_SeLe, log_SeWi, Rarity3, Clim_Humid, Clim_Subhumid, Clim_Subarid, Clim_Dry, Clim_Montane, Veg_Forest, Veg_Bush, Veg_Shrub, Veg_Grass, Veg_Mang, Veg_Anthr, IUCN,FrOd2, SeProt2))
## Give proper class to variables
pt3$Rarity3<-factor(pt3$Rarity3, order = TRUE, levels = c("0", "1", "2", "3"))
pt3$log_FrLe<-as.numeric(pt3$log_FrLe)
pt3$log_FrWi<-as.numeric(pt3$log_FrWi)
pt3$log_SeWi<-as.numeric(pt3$log_SeWi)
pt3$log_SeLe<-as.numeric(pt3$log_SeLe)
pt3$SeFr<-as.numeric(pt3$SeFr)
How many endemic plant species are there in Madagascar?
In [19]:
nrow(pt3)
In [32]:
## Let's have a look at the data
summary(pt3)
It seems that the dataset has many missing data. Let's check that.
In [33]:
## Total % NA in dataset (after having removed some unnecessary variables)
ptx <-subset(pt3,select=-c(Gen, Fam, Clade2, log_FrLe, log_SeLe, Rarity3))
percNA <- sum(is.na(ptx))/(nrow(ptx)*ncol(ptx))*100
percNA
Yes, 37% of missing data. That's rather important. Let's see whether we can decrease a bit the number of missing data.
In [34]:
## check % NA in each variable
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(pt3,2,pMiss)
Some variables have a very high rate of missing values. Let's remove those with more than 90% NA: FrOd2, SeProt2 and IUCN.
Now, do we have species with a high rate of missing values?
In [35]:
# check % NA in observations
pMiss <- function(x){sum(is.na(x))/length(x)*100}
pty <-subset(pt3,select=-c(Gen, Fam, Clade2, log_FrLe, log_SeLe, Rarity3, FrOd2, SeProt2, IUCN))
pt3$percentNA<-apply(pty,1,pMiss)
pt3 <- pt3[order(-pt3$percentNA),]
head(pt3)
There are 4 species without any data (100% NA). Let's remove them.
In [36]:
## Remove species with 100% NA (i.e. no data)
pt4 <-pt3[ ! pt3$percentNA== 100, ]
In [37]:
## Number of species removed
nrow(pt3)-nrow(pt4)
In [38]:
## New number of species
nrow(pt4)
Now, let's check again the percentage of missing data in the dataset.
In [42]:
## Total % NA in dataset (after having removed some unnecessary variables)
ptz <-subset(pt4,select=-c(Gen, Fam, Clade2, log_FrLe, log_SeLe, Rarity3, FrOd2, SeProt2, IUCN))
percNA2 <- sum(is.na(ptz))/(nrow(ptz)*ncol(ptz))*100
percNA2
In [39]:
## Number of species with all the data available (0 NA)
pt_all <-pt4[pt4$percentNA== 0, ]
nrow(pt_all)
Due to the random scattering of missing values across the dataset, a complete case analysis, in which all observations with one or more missing values are omitted, would cover only 281 species (i.e. just 3% of the original 8784 species). This would lead to non-negligible loss of power with misleading results and biased estimates of parameters.
We have to perform multiple imputations of missing values.
In [45]:
library(mice);
pt5 <-read.csv("planttrait2.csv", header=T, sep=";",row.names=1)
In [46]:
## Give proper class to variables
pt5$Rarity3<-factor(pt5$Rarity3, order = TRUE, levels = c("0", "1", "2", "3"))
pt5$log_FrLe<-as.numeric(pt5$log_FrLe)
pt5$log_FrWi<-as.numeric(pt5$log_FrWi)
pt5$log_SeWi<-as.numeric(pt5$log_SeWi)
pt5$log_SeLe<-as.numeric(pt5$log_SeLe)
pt5$SeFr<-as.numeric(pt5$SeFr)
str(pt5)
Legend:Gen=genus; Fam=family; Lform3=life form (HE=herb,LI=liana,SH=shrub,Tree); FrType=fruit type; FrClass2=fruit class (DD=dry dehiscent, DI=dry indehiscent, F=fleshy); Dcol2=diaspore colour; FrHusk2=fruit husk; FrLe=fruit length; FrWi=fruit width; SeFr=number of seeds per fruit; SeLe=seed length; SeWi=seed width;
We will use
NB: for categorical data, 'cart' seems to be better than 'polyreg' and 'polr'(cf. Akande 2015); however, we have some predictors with many levels (family, genus), so it is impossible to use a method based on trees.
In [47]:
## Prepare predictor matrix
new.pred<-quickpred(pt5)
ini <- mice(pt5, maxit=0, pri=F) # create the mids object called ini containing the default settings
meth <- ini$meth
meth[c("Gen","Fam","Clade2","FrClass2","log_FrLe","log_SeLe", "Veg_Mang","FrOd2", "SeProt2", "IUCN")] <- "" # Skip a variable from imputation while this variable will still be used for prediction; Veg_Mang is too unbalanced (proba to have 'no' ~1)
meth[c("Clim_Humid","Clim_Subhumid","Clim_Subarid","Clim_Dry","Clim_Montane","Veg_Forest","Veg_Bush","Veg_Shrub","Veg_Grass","Veg_Anthr")] <- "logreg"
meth[c("Lform3","FrType2","Dcol2","FrHusk2","Pulp2")] <- "polyreg"
meth[c("log_FrWi","SeFr", "log_SeWi")] <- "pmm"
meth["Rarity3"] <- "polr"
In [ ]:
## Impute missing data
imputed_Data <- mice(pt5, MaxNWts = 20000, m=20, pred=new.pred, meth=meth, maxit = 10, seed=123)
summary(imputed_Data)
NB: the code is very long to run (5 days) so the output has been copied-pasted from a previous run in the following cell.
In [ ]:
Multiply imputed data set
Call:
mice(data = pt5, m = 20, method = meth, predictorMatrix = new.pred,
maxit = 10, seed = 123, MaxNWts = 20000)
Number of multiple imputations: 20
Missing cells per column:
Gen Fam Clade2 Lform3 FrType2 FrClass2 Dcol2 FrHusk2 Pulp2 log_FrLe log_FrWi
0 0 0 216 149 295 7102 3755 535 4360 5470
SeFr log_SeLe log_SeWi Rarity3 Clim_Humid Clim_Subhumid Clim_Subarid Clim_Dry Clim_Montane Veg_Forest Veg_Bush
5158 5427 6743 2 1516 1516 1516 1516 1516 2171 2171
Veg_Shrub Veg_Grass Veg_Mang Veg_Anthr IUCN FrOd2 SeProt2
2171 2171 2171 2171 7953 8724 7988
Imputation methods:
Gen Fam Clade2 Lform3 FrType2 FrClass2 Dcol2 FrHusk2 Pulp2 log_FrLe log_FrWi
"" "" "" "polyreg" "polyreg" "" "polyreg" "polyreg" "polyreg" "" "pmm"
SeFr log_SeLe log_SeWi Rarity3 Clim_Humid Clim_Subhumid Clim_Subarid Clim_Dry Clim_Montane Veg_Forest Veg_Bush
"pmm" "" "pmm" "polr" "logreg" "logreg" "logreg" "logreg" "logreg" "logreg" "logreg"
Veg_Shrub Veg_Grass Veg_Mang Veg_Anthr IUCN FrOd2 SeProt2
"logreg" "logreg" "" "logreg" "" "" ""
VisitSequence:
Lform3 FrType2 Dcol2 FrHusk2 Pulp2 log_FrWi SeFr log_SeWi Rarity3 Clim_Humid Clim_Subhumid
4 5 7 8 9 11 12 14 15 16 17
Clim_Subarid Clim_Dry Clim_Montane Veg_Forest Veg_Bush Veg_Shrub Veg_Grass Veg_Mang Veg_Anthr
18 19 20 21 22 23 24 25 26
PredictorMatrix:
Gen Fam Clade2 Lform3 FrType2 FrClass2 Dcol2 FrHusk2 Pulp2 log_FrLe log_FrWi SeFr log_SeLe log_SeWi Rarity3 Clim_Humid Clim_Subhumid Clim_Subarid
Gen 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Fam 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Clade2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Lform3 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0
FrType2 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 1
FrClass2 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 0
Dcol2 1 0 1 1 1 0 0 1 1 0 1 0 0 1 0 1 0 1
FrHusk2 1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 0 0
Pulp2 1 1 1 1 1 0 1 1 0 0 1 0 0 1 0 1 1 0
log_FrLe 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0
log_FrWi 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 0 1 0
SeFr 1 0 1 1 1 0 0 1 1 0 1 0 0 1 0 0 0 1
log_SeLe 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1
log_SeWi 0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0
Rarity3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1
Clim_Humid 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 1 1
Clim_Subhumid 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 0 1
Clim_Subarid 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0
Clim_Dry 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0
Clim_Montane 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1
Veg_Forest 0 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 0 1
Veg_Bush 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1
Veg_Shrub 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0
Veg_Grass 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0
Veg_Mang 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Veg_Anthr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
IUCN 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1
FrOd2 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1
SeProt2 0 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 0 0
Clim_Dry Clim_Montane Veg_Forest Veg_Bush Veg_Shrub Veg_Grass Veg_Mang Veg_Anthr IUCN FrOd2 SeProt2
Gen 0 0 0 0 0 0 0 0 0 0 0
Fam 0 0 0 0 0 0 0 0 0 0 0
Clade2 0 0 0 0 0 0 0 0 0 0 0
Lform3 0 1 1 0 0 1 0 0 0 0 0
FrType2 1 1 1 0 0 0 0 0 0 0 0
FrClass2 0 0 1 0 0 0 0 0 0 0 0
Dcol2 0 0 1 1 0 0 0 0 0 0 0
FrHusk2 0 0 1 0 0 0 0 0 0 0 0
Pulp2 0 1 1 0 1 1 0 0 0 0 0
log_FrLe 0 0 1 0 0 0 0 0 0 0 0
log_FrWi 1 1 1 0 1 0 0 0 0 0 0
SeFr 0 0 0 0 0 0 0 0 0 0 0
log_SeLe 1 1 1 0 1 0 0 0 0 0 0
log_SeWi 1 1 1 0 1 1 0 0 0 0 0
Rarity3 0 0 0 1 0 0 0 0 0 0 0
Clim_Humid 1 1 1 1 0 1 0 0 0 0 0
Clim_Subhumid 1 1 0 1 1 1 0 0 0 0 0
Clim_Subarid 0 1 1 1 0 0 0 0 0 0 0
Clim_Dry 0 1 0 0 0 0 0 0 0 0 0
Clim_Montane 1 0 1 0 1 1 0 0 0 0 0
Veg_Forest 0 1 0 1 1 1 0 0 0 0 0
Veg_Bush 0 0 1 0 0 0 0 0 0 0 0
Veg_Shrub 0 1 1 0 0 0 0 0 0 0 0
Veg_Grass 0 1 1 0 0 0 0 1 0 0 0
Veg_Mang 0 0 0 0 0 0 0 0 0 0 0
Veg_Anthr 0 0 0 0 0 1 0 0 0 0 0
IUCN 0 0 0 1 0 0 0 0 0 0 0
FrOd2 1 1 1 1 0 1 0 0 0 0 0
SeProt2 0 0 1 0 0 0 0 0 0 0 0
Random generator seed value: 123
In [ ]:
## Save imputed dataset
Datimp <- complete(imputed_Data, "long", include=TRUE)
write.table(Datimp, file="PT-data-mice2.csv", sep=",",row.names=FALSE)