Labos en R

Tous les labos, version R. (Pas de labo 1.)


Labo 2

Help


In [8]:
# help()
# help(function)
# help(package='package-name)

Packages


In [2]:
# install
# install.packages('package-name')

# already installed with conda
#install.packages("foreign")

# new installs
#install.packages("Rcmdr", dependencies = TRUE, repos="http://cran.rstudio.com/") # in conda?
#install.packages("nortest", repos="http://cran.rstudio.com/")
#install.packages("sas7bdat", repos="http://cran.rstudio.com/")
#install.packages("Hmisc", repos="http://cran.rstudio.com/")
#install.packages("pastecs", repos="http://cran.rstudio.com/")

# import
# library('package-name')

library(foreign)
library(nortest)
library(sas7bdat)
library(Hmisc)
library(pastecs)


Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: boot

Attaching package: ‘boot’

The following object is masked from ‘package:survival’:

    aml

The following object is masked from ‘package:lattice’:

    melanoma

Working space


In [2]:
ls()
# rm(list=ls())
# setwd()
getwd()


Out[2]:
Out[2]:
'/home/inrs/EUR8217/labo'

Read data


In [3]:
# import excel : via txt tab separated
#fichierTexte <- read.table("data/labo2/SR_Data.txt", header = TRUE)

# import DBF (DBase)
fichierDBF <- read.dbf("data/labo2/SR_Data.dbf")

# import SPSS
#fichierSPSS <- read.spss("data/labo2/Data_SPSS.sav", to.data.frame=TRUE)

# import SAS
#fichierSAS <- read.sas7bdat("data/labo2/tableau1.sas7bdat", debug=FALSE)

head(fichierDBF)


Out[3]:
POPTOT_FRFAIBLEREVMONOPCTMENAGE1PCTIMMREC_PCTTX_CHOMNOECOLEPCTSCO_M9PCTSCO_M13PCTPARTIELPCTFAIBREVPCTINDICE_PAUDist_MinN_1000Dist_Moy_3Shape_LengShape_Area
19703511.4816.671.031.886.1624.6645.133.610.496816264.77208835.78620110.137483047
29105296521.7424.225.4310.3630.429.6434.4441.6832.561.492181458.9560.1793352.85412854.562958949
3419043513.9326.593.14.5522.693.7628.8440.9410.380.699961094.8870.3721862.3797010.8691452463
4130033522.9560.360.778.8968.757.2336.633.3325.771.156881155.8350.3481826.476303.374683634.5
56270101015.4721.963.437.5229.314.5933.2245.0816.110.897151097.9450.591652.0415814.0481764655
6434093516.88233.114.61256.6134.0347.621.540.97111705.6721.0751343.4238928.1981105847

Table structure


In [4]:
# show variable names
names(fichierDBF)
# indexes start at 1

# delete variable
fichierDBF$Shape_Leng <- NULL

# rename variable
names(fichierDBF)[1] <- "POPTOT"

# create variable
fichierDBF$km <- fichierDBF$Shape_Area / 1000000
fichierDBF$HabKm2 <- fichierDBF$POPTOT / fichierDBF$km

head(fichierDBF)


Out[4]:
  1. 'POPTOT_FR'
  2. 'FAIBLEREV'
  3. 'MONOPCT'
  4. 'MENAGE1PCT'
  5. 'IMMREC_PCT'
  6. 'TX_CHOM'
  7. 'NOECOLEPCT'
  8. 'SCO_M9PCT'
  9. 'SCO_M13PCT'
  10. 'PARTIELPCT'
  11. 'FAIBREVPCT'
  12. 'INDICE_PAU'
  13. 'Dist_Min'
  14. 'N_1000'
  15. 'Dist_Moy_3'
  16. 'Shape_Leng'
  17. 'Shape_Area'
Out[4]:
POPTOTFAIBLEREVMONOPCTMENAGE1PCTIMMREC_PCTTX_CHOMNOECOLEPCTSCO_M9PCTSCO_M13PCTPARTIELPCTFAIBREVPCTINDICE_PAUDist_MinN_1000Dist_Moy_3Shape_AreakmHabKm2
19703511.4816.671.031.886.1624.6645.133.610.496816264.77208835.78674830477.483047129.6263
29105296521.7424.225.4310.3630.429.6434.4441.6832.561.492181458.9560.1793352.85429589492.9589493077.106
3419043513.9326.593.14.5522.693.7628.8440.9410.380.699961094.8870.3721862.37914524631.4524632884.755
4130033522.9560.360.778.8968.757.2336.633.3325.771.156881155.8350.3481826.47683634.50.68363451901.601
56270101015.4721.963.437.5229.314.5933.2245.0816.110.897151097.9450.591652.04117646551.7646553553.102
6434093516.88233.114.61256.6134.0347.621.540.97111705.6721.0751343.42311058471.1058473924.595

In [5]:
# new table from a subset
names(fichierDBF)
ZScores <-fichierDBF[,c(12:15)]
names(ZScores)


Out[5]:
  1. 'POPTOT'
  2. 'FAIBLEREV'
  3. 'MONOPCT'
  4. 'MENAGE1PCT'
  5. 'IMMREC_PCT'
  6. 'TX_CHOM'
  7. 'NOECOLEPCT'
  8. 'SCO_M9PCT'
  9. 'SCO_M13PCT'
  10. 'PARTIELPCT'
  11. 'FAIBREVPCT'
  12. 'INDICE_PAU'
  13. 'Dist_Min'
  14. 'N_1000'
  15. 'Dist_Moy_3'
  16. 'Shape_Area'
  17. 'km'
  18. 'HabKm2'
Out[5]:
  1. 'INDICE_PAU'
  2. 'Dist_Min'
  3. 'N_1000'
  4. 'Dist_Moy_3'

Normality


In [17]:
#ks.test(fichierDBF[18:20])


Error in FUN(X[[i]], ...): l'argument "y" est manquant, avec aucune valeur par défaut

In [14]:
sapply(fichierDBF[18:20],lillie.test)


Out[14]:
HabKm2SqrtDensSqrtImg
statistic0.094654680.027426440.07141727
p.value1.051956e-110.4694661.823375e-06
methodLilliefors (Kolmogorov-Smirnov) normality testLilliefors (Kolmogorov-Smirnov) normality testLilliefors (Kolmogorov-Smirnov) normality test
data.nameX[[i]]X[[i]]X[[i]]

Shapiro-Wilk


In [15]:
sapply(fichierDBF[18:20],shapiro.test)


Out[15]:
HabKm2SqrtDensSqrtImg
statistic0.9023450.9904850.9697826
p.value1.811718e-170.0023717011.052118e-08
methodShapiro-Wilk normality testShapiro-Wilk normality testShapiro-Wilk normality test
data.nameX[[i]]X[[i]]X[[i]]

Transformations

Square root


In [6]:
fichierDBF$SqrtDens <- sqrt(fichierDBF$HabKm2)
fichierDBF$SqrtImg <- sqrt(fichierDBF$IMMREC_PCT)

Logarithmic


In [7]:
# log(0) = error
fichierDBF$LogDens <- log(fichierDBF$HabKm2)
fichierDBF$LogImg <- log(fichierDBF$IMMREC_PCT+1)

summary(fichierDBF)


Out[7]:
     POPTOT       FAIBLEREV         MONOPCT        MENAGE1PCT   
 Min.   : 245   Min.   :  10.0   Min.   : 0.00   Min.   : 3.94  
 1st Qu.:2241   1st Qu.: 521.2   1st Qu.:16.05   1st Qu.:28.59  
 Median :3328   Median : 900.0   Median :21.23   Median :38.60  
 Mean   :3500   Mean   :1015.3   Mean   :21.38   Mean   :37.67  
 3rd Qu.:4544   3rd Qu.:1330.0   3rd Qu.:26.18   3rd Qu.:46.76  
 Max.   :9105   Max.   :4195.0   Max.   :51.28   Max.   :72.63  
   IMMREC_PCT        TX_CHOM         NOECOLEPCT      SCO_M9PCT    
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 2.112   1st Qu.: 6.593   1st Qu.:24.52   1st Qu.: 7.56  
 Median : 3.850   Median : 8.555   Median :32.66   Median :14.23  
 Mean   : 5.199   Mean   : 9.456   Mean   :32.66   Mean   :14.63  
 3rd Qu.: 6.473   3rd Qu.:11.670   3rd Qu.:40.93   3rd Qu.:20.97  
 Max.   :25.790   Max.   :47.440   Max.   :68.75   Max.   :37.05  
   SCO_M13PCT      PARTIELPCT      FAIBREVPCT      INDICE_PAU    
 Min.   : 5.10   Min.   :30.65   Min.   : 1.23   Min.   :0.1743  
 1st Qu.:27.93   1st Qu.:41.26   1st Qu.:19.76   1st Qu.:1.1573  
 Median :40.26   Median :45.47   Median :28.70   Median :1.5480  
 Mean   :39.42   Mean   :45.61   Mean   :29.98   Mean   :1.5608  
 3rd Qu.:52.14   3rd Qu.:49.65   3rd Qu.:39.80   3rd Qu.:1.9303  
 Max.   :70.49   Max.   :69.79   Max.   :82.64   Max.   :3.8956  
    Dist_Min          N_1000         Dist_Moy_3       Shape_Area      
 Min.   : 182.5   Min.   :0.0000   Min.   : 422.3   Min.   :   38221  
 1st Qu.: 534.7   1st Qu.:0.4778   1st Qu.:1013.7   1st Qu.:  238109  
 Median : 728.9   Median :1.0000   Median :1262.8   Median :  482166  
 Mean   : 909.7   Mean   :1.2198   Mean   :1489.5   Mean   :  962204  
 3rd Qu.:1049.5   3rd Qu.:1.8077   3rd Qu.:1634.3   3rd Qu.:  936821  
 Max.   :6389.7   Max.   :5.5640   Max.   :8835.8   Max.   :28875026  
       km               HabKm2         SqrtDens         SqrtImg     
 Min.   : 0.03822   Min.   :  124   Min.   : 11.13   Min.   :0.000  
 1st Qu.: 0.23811   1st Qu.: 3859   1st Qu.: 62.12   1st Qu.:1.453  
 Median : 0.48217   Median : 6902   Median : 83.08   Median :1.962  
 Mean   : 0.96220   Mean   : 7996   Mean   : 84.08   Mean   :2.074  
 3rd Qu.: 0.93682   3rd Qu.:11371   3rd Qu.:106.63   3rd Qu.:2.544  
 Max.   :28.87503   Max.   :44777   Max.   :211.61   Max.   :5.078  
    LogDens           LogImg     
 Min.   : 4.820   Min.   :0.000  
 1st Qu.: 8.258   1st Qu.:1.135  
 Median : 8.840   Median :1.579  
 Mean   : 8.700   Mean   :1.588  
 3rd Qu.: 9.339   3rd Qu.:2.011  
 Max.   :10.709   Max.   :3.288  

Centrage et réduction


In [9]:
ZScores$INDICE_PAU <- scale(fichierDBF[1], center = TRUE, scale = TRUE)
ZScores$Dist_Min <- scale(fichierDBF[2], center = TRUE, scale = TRUE)
ZScores$N_1000 <- scale(fichierDBF[3], center = TRUE, scale = TRUE)
ZScores$Dist_Moy_3 <- scale(fichierDBF[4], center = TRUE, scale = TRUE)

#help(sapply)
sapply(ZScores,mean)
sapply(ZScores,sd)


Out[9]:
INDICE_PAU
2.69285098798269e-17
Dist_Min
5.71711215386722e-17
N_1000
-1.9236874870272e-16
Dist_Moy_3
-2.03659664410056e-16
Out[9]:
INDICE_PAU
1
Dist_Min
1
N_1000
1
Dist_Moy_3
1

Descriptive statistics


In [8]:
summary(fichierDBF)


Out[8]:
     POPTOT       FAIBLEREV         MONOPCT        MENAGE1PCT   
 Min.   : 245   Min.   :  10.0   Min.   : 0.00   Min.   : 3.94  
 1st Qu.:2241   1st Qu.: 521.2   1st Qu.:16.05   1st Qu.:28.59  
 Median :3328   Median : 900.0   Median :21.23   Median :38.60  
 Mean   :3500   Mean   :1015.3   Mean   :21.38   Mean   :37.67  
 3rd Qu.:4544   3rd Qu.:1330.0   3rd Qu.:26.18   3rd Qu.:46.76  
 Max.   :9105   Max.   :4195.0   Max.   :51.28   Max.   :72.63  
   IMMREC_PCT        TX_CHOM         NOECOLEPCT      SCO_M9PCT    
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 2.112   1st Qu.: 6.593   1st Qu.:24.52   1st Qu.: 7.56  
 Median : 3.850   Median : 8.555   Median :32.66   Median :14.23  
 Mean   : 5.199   Mean   : 9.456   Mean   :32.66   Mean   :14.63  
 3rd Qu.: 6.473   3rd Qu.:11.670   3rd Qu.:40.93   3rd Qu.:20.97  
 Max.   :25.790   Max.   :47.440   Max.   :68.75   Max.   :37.05  
   SCO_M13PCT      PARTIELPCT      FAIBREVPCT      INDICE_PAU    
 Min.   : 5.10   Min.   :30.65   Min.   : 1.23   Min.   :0.1743  
 1st Qu.:27.93   1st Qu.:41.26   1st Qu.:19.76   1st Qu.:1.1573  
 Median :40.26   Median :45.47   Median :28.70   Median :1.5480  
 Mean   :39.42   Mean   :45.61   Mean   :29.98   Mean   :1.5608  
 3rd Qu.:52.14   3rd Qu.:49.65   3rd Qu.:39.80   3rd Qu.:1.9303  
 Max.   :70.49   Max.   :69.79   Max.   :82.64   Max.   :3.8956  
    Dist_Min          N_1000         Dist_Moy_3       Shape_Area      
 Min.   : 182.5   Min.   :0.0000   Min.   : 422.3   Min.   :   38221  
 1st Qu.: 534.7   1st Qu.:0.4778   1st Qu.:1013.7   1st Qu.:  238109  
 Median : 728.9   Median :1.0000   Median :1262.8   Median :  482166  
 Mean   : 909.7   Mean   :1.2198   Mean   :1489.5   Mean   :  962204  
 3rd Qu.:1049.5   3rd Qu.:1.8077   3rd Qu.:1634.3   3rd Qu.:  936821  
 Max.   :6389.7   Max.   :5.5640   Max.   :8835.8   Max.   :28875026  
       km               HabKm2         SqrtDens         SqrtImg     
 Min.   : 0.03822   Min.   :  124   Min.   : 11.13   Min.   :0.000  
 1st Qu.: 0.23811   1st Qu.: 3859   1st Qu.: 62.12   1st Qu.:1.453  
 Median : 0.48217   Median : 6902   Median : 83.08   Median :1.962  
 Mean   : 0.96220   Mean   : 7996   Mean   : 84.08   Mean   :2.074  
 3rd Qu.: 0.93682   3rd Qu.:11371   3rd Qu.:106.63   3rd Qu.:2.544  
 Max.   :28.87503   Max.   :44777   Max.   :211.61   Max.   :5.078  
    LogDens           LogImg     
 Min.   : 4.820   Min.   :0.000  
 1st Qu.: 8.258   1st Qu.:1.135  
 Median : 8.840   Median :1.579  
 Mean   : 8.700   Mean   :1.588  
 3rd Qu.: 9.339   3rd Qu.:2.011  
 Max.   :10.709   Max.   :3.288  

In [14]:
sapply(fichierDBF, mean)
sapply(fichierDBF, sd)
sapply(fichierDBF, min)
sapply(fichierDBF, max)
sapply(fichierDBF, median)
sapply(fichierDBF, range)
sapply(fichierDBF, quantile)


Out[14]:
POPTOT
245
FAIBLEREV
10
MONOPCT
0
MENAGE1PCT
3.94
IMMREC_PCT
0
TX_CHOM
0
NOECOLEPCT
0
SCO_M9PCT
0
SCO_M13PCT
5.1
PARTIELPCT
30.65
FAIBREVPCT
1.23
INDICE_PAU
0.17429
Dist_Min
182.54
N_1000
0
Dist_Moy_3
422.322
Shape_Area
38220.592566
km
0.038220592566
HabKm2
123.982571312476
SqrtDens
11.1347461269881
SqrtImg
0
LogDens
4.82014100179476
LogImg
0
Out[14]:
POPTOT
9105
FAIBLEREV
4195
MONOPCT
51.28
MENAGE1PCT
72.63
IMMREC_PCT
25.79
TX_CHOM
47.44
NOECOLEPCT
68.75
SCO_M9PCT
37.05
SCO_M13PCT
70.49
PARTIELPCT
69.79
FAIBREVPCT
82.64
INDICE_PAU
3.89559
Dist_Min
6389.749
N_1000
5.564
Dist_Moy_3
8835.786
Shape_Area
28875026.2404
km
28.8750262404
HabKm2
44776.9715443951
SqrtDens
211.605698279595
SqrtImg
5.07838557023785
LogDens
10.7094492582185
LogImg
3.28802868355652
Out[14]:
POPTOTFAIBLEREVMONOPCTMENAGE1PCTIMMREC_PCTTX_CHOMNOECOLEPCTSCO_M9PCTSCO_M13PCTPARTIELPCTFAIBREVPCTINDICE_PAUDist_MinN_1000Dist_Moy_3Shape_AreakmHabKm2SqrtDensSqrtImgLogDensLogImg
2.450000e+021.000000e+010.000000e+003.940000e+000.000000e+000.000000e+000.000000e+000.000000e+005.100000e+003.065000e+011.230000e+001.742900e-011.825400e+020.000000e+004.223220e+023.822059e+043.822059e-021.239826e+021.113475e+010.000000e+004.820141e+000.000000e+00
9.105000e+034.195000e+035.128000e+017.263000e+012.579000e+014.744000e+016.875000e+013.705000e+017.049000e+016.979000e+018.264000e+013.895590e+006.389749e+035.564000e+008.835786e+032.887503e+072.887503e+014.477697e+042.116057e+025.078386e+001.070945e+013.288029e+00

In [15]:
# Hmisc.describe
describe(fichierDBF)


Out[15]:
fichierDBF 

 22  Variables      506  Observations
--------------------------------------------------------------------------------
POPTOT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     403       1    3500    1225    1600    2241    3328    4544 
    .90     .95 
   5905    6492 

lowest :  245  280  405  415  420, highest: 7435 7485 7605 8240 9105 
--------------------------------------------------------------------------------
FAIBLEREV 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     288       1    1015   216.2   310.0   521.2   900.0  1330.0 
    .90     .95 
 1867.5  2283.8 

lowest :   10   25   35   60   70, highest: 3225 3320 3360 3505 4195 
--------------------------------------------------------------------------------
MONOPCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     444       1   21.38   8.793  11.365  16.053  21.225  26.185 
    .90     .95 
 31.230  34.517 

lowest :  0.00  3.59  4.65  5.54  6.07, highest: 42.25 43.04 45.26 49.40 51.28 
--------------------------------------------------------------------------------
MENAGE1PCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     475       1   37.67   15.25   19.41   28.59   38.60   46.76 
    .90     .95 
  53.19   56.93 

lowest :  3.94  5.20  5.79  6.47  7.98, highest: 68.12 68.57 69.37 71.75 72.63 
--------------------------------------------------------------------------------
IMMREC_PCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     396       1   5.199  0.5475  1.0450  2.1125  3.8500  6.4725 
    .90     .95 
10.7000 15.3425 

lowest :  0.00  0.16  0.24  0.34  0.35, highest: 22.69 23.02 23.50 24.96 25.79 
--------------------------------------------------------------------------------
TX_CHOM 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     402       1   9.456   3.973   4.950   6.592   8.555  11.670 
    .90     .95 
 14.960  16.970 

lowest :  0.00  0.95  1.80  1.88  2.31, highest: 25.83 28.03 28.70 30.40 47.44 
--------------------------------------------------------------------------------
NOECOLEPCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     430       1   32.66   12.64   16.11   24.52   32.66   40.93 
    .90     .95 
  48.23   52.99 

lowest :  0.00  6.67  7.09  7.56  7.81, highest: 62.50 65.00 65.67 66.67 68.75 
--------------------------------------------------------------------------------
SCO_M9PCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     467       1   14.63   1.980   3.325   7.560  14.230  20.968 
    .90     .95 
 26.200  28.343 

lowest :  0.00  0.57  0.67  0.71  0.72, highest: 34.43 34.48 35.42 35.68 37.05 
--------------------------------------------------------------------------------
SCO_M13PCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     481       1   39.42   13.65   16.74   27.93   40.26   52.14 
    .90     .95 
  58.84   61.25 

lowest :  5.10  6.00  6.95  7.76  7.80, highest: 66.98 68.17 70.30 70.48 70.49 
--------------------------------------------------------------------------------
PARTIELPCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     463       1   45.61   35.91   37.55   41.25   45.47   49.65 
    .90     .95 
  53.66   56.98 

lowest : 30.65 30.75 30.99 31.40 31.60, highest: 65.22 67.00 67.65 68.57 69.79 
--------------------------------------------------------------------------------
FAIBREVPCT 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     475       1   29.98   8.178  11.565  19.763  28.700  39.800 
    .90     .95 
 49.755  53.545 

lowest :  1.23  2.38  3.32  3.47  3.61, highest: 62.78 66.12 74.49 81.48 82.64 
--------------------------------------------------------------------------------
INDICE_PAU 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1   1.561  0.5949  0.7546  1.1573  1.5480  1.9303 
    .90     .95 
 2.3171  2.6706 

lowest : 0.1743 0.2353 0.3180 0.3570 0.3859
highest: 3.1175 3.1740 3.2172 3.4252 3.8956 
--------------------------------------------------------------------------------
Dist_Min 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1   909.7   334.6   379.5   534.7   728.9  1049.5 
    .90     .95 
 1569.0  2016.0 

lowest :  182.5  201.2  216.3  230.9  242.2
highest: 3841.6 3905.5 4161.8 6264.8 6389.7 
--------------------------------------------------------------------------------
N_1000 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     405       1    1.22  0.0000  0.0265  0.4778  1.0000  1.8077 
    .90     .95 
 2.6680  3.0715 

lowest : 0.000 0.001 0.014 0.017 0.023, highest: 3.988 3.998 4.176 5.112 5.564 
--------------------------------------------------------------------------------
Dist_Moy_3 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1    1490   724.2   801.8  1013.7  1262.8  1634.3 
    .90     .95 
 2420.7  2928.6 

lowest :  422.3  535.6  577.4  601.8  608.1
highest: 4932.1 5037.8 7535.4 7820.8 8835.8 
--------------------------------------------------------------------------------
Shape_Area 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1  962204  118864  146999  238109  482166  936821 
    .90     .95 
1746212 2819107 

lowest :    38221    69707    71072    87174    90141
highest: 10145933 12484527 13379330 15415882 28875026 
--------------------------------------------------------------------------------
km 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1  0.9622  0.1189  0.1470  0.2381  0.4822  0.9368 
    .90     .95 
 1.7462  2.8191 

lowest :  0.03822  0.06971  0.07107  0.08717  0.09014
highest: 10.14593 12.48453 13.37933 15.41588 28.87503 
--------------------------------------------------------------------------------
HabKm2 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1    7996    1136    2103    3859    6902   11371 
    .90     .95 
  15141   16839 

lowest :   124.0   129.6   204.2   217.5   228.3
highest: 23527.2 30830.1 31648.4 37676.0 44777.0 
--------------------------------------------------------------------------------
SqrtDens 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1   84.08   33.71   45.86   62.12   83.08  106.63 
    .90     .95 
 123.05  129.76 

lowest :  11.13  11.39  14.29  14.75  15.11
highest: 153.39 175.59 177.90 194.10 211.61 
--------------------------------------------------------------------------------
SqrtImg 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     396       1   2.074  0.7399  1.0222  1.4534  1.9621  2.5441 
    .90     .95 
 3.2711  3.9169 

lowest : 0.0000 0.4000 0.4899 0.5831 0.5916
highest: 4.7634 4.7979 4.8477 4.9960 5.0784 
--------------------------------------------------------------------------------
LogDens 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     506       1     8.7   7.035   7.651   8.258   8.840   9.339 
    .90     .95 
  9.625   9.731 

lowest :  4.820  4.865  5.319  5.382  5.431
highest: 10.066 10.336 10.362 10.537 10.709 
--------------------------------------------------------------------------------
LogImg 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
    506       0     396       1   1.588  0.4366  0.7154  1.1354  1.5790  2.0112 
    .90     .95 
 2.4596  2.7937 

lowest : 0.0000 0.1484 0.2151 0.2927 0.3001
highest: 3.1651 3.1789 3.1987 3.2566 3.2880 
--------------------------------------------------------------------------------

In [13]:
# pastece.stat.desc
stat.desc(fichierDBF, basic=TRUE, norm=TRUE)


Out[13]:
POPTOTFAIBLEREVMONOPCTMENAGE1PCTIMMREC_PCTTX_CHOMNOECOLEPCTSCO_M9PCTSCO_M13PCTPARTIELPCTFAIBREVPCTINDICE_PAUDist_MinN_1000Dist_Moy_3Shape_AreakmHabKm2SqrtDensSqrtImgLogDensLogImg
nbr.val506506506506506506506506506506506506506506506506506506506506506506
nbr.null00301212200000450000012012
nbr.na0000000000000000000000
min2451003.9400005.130.651.230.17429182.540422.32238220.590.03822059123.982611.1347504.8201410
max9105419551.2872.6325.7947.4468.7537.0570.4969.7982.643.895596389.7495.5648835.7862887502628.8750344776.97211.60575.07838610.709453.288029
range8860418551.2868.6925.7947.4468.7537.0565.3939.1481.413.72136207.2095.5648413.4642883680628.8368144652.99200.4715.0783865.8893083.288029
sum177077051376510820.119063.082630.914784.5316523.967405.1319948.5523080.9515170.67789.7804460306.1617.222753699.2486875078486.8751404608542542.111049.614402.029803.5303
median3327.590021.22538.63.858.55532.65514.2340.2645.46528.71.547965728.89311262.812482166.40.48216646901.8883.077561.962148.8395491.578977
mean3499.5451015.34621.383637.674075.1994279.45559332.6560514.6346439.4240145.6145329.981561.560831909.69581.2198061489.524962203.70.96220377996.21684.075322.0743288.6996631.588005
SE.mean72.0696129.972350.34867810.57521520.20798460.2021010.54202690.37590260.6790740.28966180.63446230.0269380629.67610.0438170638.2915888487.460.08848746243.13571.3552670.042135840.039291480.03061528
CI.mean.0.95141.593258.885860.68503841.130110.40862170.39706241.0649050.73852561.3341580.56909061.2465110.0529244658.303820.0860861875.23042173848.90.1738489477.6822.6626550.082783130.077194890.060149
var262817945456161.51768167.421521.8883520.66748148.659371.49921233.337642.4554203.68640.3671834445619.50.9714869741920.13.961996e+123.96199629912167929.39420.89836710.7811730.4742713
std.dev1621.166674.21147.84332112.939154.6784984.5461512.192598.45572115.275396.51578114.271880.6059566667.54740.9856404861.347819904761.9904765469.20230.485970.94782230.88383990.6886736
coef.var0.46325040.66402140.36679150.34344960.89981040.48078960.3733640.57778790.38746420.14284440.47602180.38822690.73381390.80803020.57827052.0686642.0686640.68397370.36260310.45692990.10159470.4336723
skewness0.45801961.2973840.3164988-0.19440951.8780872.059130.13097550.2535259-0.20828190.43784730.35560990.29239753.5059320.95053163.601247.9652187.9652181.5377120.16952920.6045602-1.348520.02862938
skew.2SE2.1092875.9747561.45755-0.89530128.6490299.4827730.60317251.167546-0.95918662.0163891.6376661.34655916.145634.37741916.5845536.6816836.681687.0815220.78072132.78414-6.2102480.1318449
kurtosis-0.28090462.1527290.505375-0.36694653.83191310.7871-0.1434156-0.7592333-0.96900130.534639-0.14501990.0698956819.518310.781791521.0938887.5404987.540495.6186790.32459040.65044752.784922-0.1660773
kurt.2SE-0.64807494.9665611.165951-0.84658218.84060624.88691-0.3308741-1.751627-2.2355831.233465-0.33457550.161256345.030681.80367148.66567201.9646201.964612.962850.74886251.5006476.425091-0.3831569
normtest.W0.97948470.90809070.99219710.99049180.8126530.88040740.99567410.9748390.97056790.98735550.98333870.99119840.69996480.92642220.70352110.35716210.35716210.9023450.9904850.96978260.9107340.993155
normtest.p1.492635e-066.229604e-170.009466510.0023845528.304258e-242.508637e-190.17667151.216573e-071.51341e-080.00022586331.513879e-050.0041881454.785911e-294.832191e-156.598409e-291.074609e-381.074609e-381.811718e-170.0023717011.052118e-081.120286e-160.02108595

Histograms


In [16]:
hist(fichierDBF$HabKm2, main="Histogramme", xlab="Habitants au km2", ylab="Effectif", breaks=10, col='lightblue')



In [17]:
hist(fichierDBF$SqrtDens, main="Histogramme", xlab="Habitants au km2 (racine)", ylab="Effectif", breaks=10, col='gold')



In [18]:
hist(fichierDBF$LogDens, main="Histogramme", xlab="Habitants au km2 log)", ylab="Effectif", breaks=10, col='coral')


Histogram with normal curve


In [20]:
x <- fichierDBF$HabKm2
h<-hist(x, breaks=10, col="lightblue", xlab="Habitants au km2", ylab="Effectif", 
main="Histogramme avec courbe normale")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)



In [21]:
x <- fichierDBF$SqrtDens
h<-hist(x, breaks=10, col="red", xlab="Habitants au km2 (racine)", ylab = "Effectif",
main="Histogramme avec courbe normale")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)



Labo 3


In [3]:
# install
#install.packages('doBy', repos="http://cran.rstudio.com/")
#install.packages('gmodels', repos="http://cran.rstudio.com/")
#install.packages('scatterplot3d', repos="http://cran.rstudio.com/")

# import
library(foreign)
library(nortest)
library(sas7bdat)
library(Hmisc)
library(pastecs)
library(ggplot2)
library(doBy)
library(gmodels)
library(scatterplot3d)

# data
Tableau1 <- read.sas7bdat("data/labo3/tableau1.sas7bdat", debug=FALSE)
names(Tableau1)

TableauKhi2 <- read.sas7bdat("data/labo3/khi2.sas7bdat", debug=FALSE)
names(TableauKhi2)


The downloaded source packages are in
	‘/tmp/RtmpU5CYbS/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
also installing the dependencies ‘gtools’, ‘gdata’

The downloaded source packages are in
	‘/tmp/RtmpU5CYbS/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
The downloaded source packages are in
	‘/tmp/RtmpU5CYbS/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Out[3]:
  1. 'POPTOT_FR'
  2. 'FAIBLEREV'
  3. 'MONOPCT'
  4. 'MENAGE1PCT'
  5. 'IMMREC_PCT'
  6. 'TX_CHOM'
  7. 'NOECOLEPCT'
  8. 'SCO_M9PCT'
  9. 'SCO_M13PCT'
  10. 'PARTIELPCT'
  11. 'FAIBREVPCT'
  12. 'INDICE_PAU'
  13. 'Dist_Min'
  14. 'N_1000'
  15. 'Dist_Moy_3'
  16. 'Km2'
  17. 'HabKm2'
  18. 'SqrtDens'
  19. 'LogDens'
  20. 'SqrtImg'
  21. 'LogImg'
  22. 'Id'
Out[3]:
  1. 'SEX'
  2. 'DIST'
  3. 'Mode'

Histogrammes classiques


In [4]:
hist(Tableau1$IMMREC_PCT, breaks=10, xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme")


breaks = nombre de barres


In [5]:
hist(Tableau1$IMMREC_PCT, breaks=20, xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme")


density = pour rendu barres (ex.: hachures)


In [6]:
hist(Tableau1$IMMREC_PCT, density=20, breaks=20, xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme")


col = colours


In [7]:
hist(Tableau1$IMMREC_PCT, breaks=20, col="red", xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme") 
hist(Tableau1$IMMREC_PCT, breaks=20, col="lightyellow", xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme") 
hist(Tableau1$IMMREC_PCT, breaks=20, col="lightsalmon", xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme") 
hist(Tableau1$IMMREC_PCT, breaks=20, col="lightgreen", xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme")


ylim = limites


In [10]:
plot(
    hist(Tableau1$IMMREC_PCT, breaks=20),
    ylim=c(0, 80), col="lightgreen", xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme"
)


prob : proportion vs effectif


In [11]:
hist(Tableau1$IMMREC_PCT, col="lightgray", breaks=20, xlab="Immigrants récents (%)", ylab = "Proportion", main="Histogramme", prob=TRUE)


Histogrammes avec courbe normale

y = proportion


In [13]:
m <- mean(Tableau1$IMMREC_PCT)
std <- sd(Tableau1$IMMREC_PCT)
hist(Tableau1$IMMREC_PCT, col="lightyellow", breaks=20, prob=TRUE, xlab="Immigrants récents (%)", ylab = "Proportion", main="Histogramme avec la courbe normale")
curve(dnorm(x, mean=m, sd=std), col="darkblue", lwd=2, add=TRUE)


y = effectif


In [14]:
x <- Tableau1$IMMREC_PCT
h<-hist(x, breaks=20, col="lightyellow", xlab="Immigrants récents (%)", ylab = "Effectif", main="Histogramme avec la courbe normale") 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="darkblue", lwd=2)


Nuages de points


In [15]:
plot(Tableau1$IMMREC_PCT, Tableau1$FAIBREVPCT, xlab="Immigrants récents (%)", ylab = "Faible revenu (%)", main="Nuage de points")


Nuages de points avec droite de régression


In [16]:
plot(Tableau1$IMMREC_PCT, Tableau1$FAIBREVPCT, xlab="Immigrants récents (%)", ylab = "Faible revenu (%)", main="Nuage de points avec droite de régression")
abline(lsfit(Tableau1$IMMREC_PCT, Tableau1$FAIBREVPCT))


Matrice de nuage de points


In [17]:
pairs(~MONOPCT+MENAGE1PCT+TX_CHOM+FAIBREVPCT,data=Tableau1, 
      main="Matrice de nuages de points")


Nuages de point 3D


In [18]:
scatterplot3d(Tableau1$MONOPCT, Tableau1$TX_CHOM, Tableau1$FAIBREVPCT, main="Nuage de points 3D")
scatterplot3d(Tableau1$MONOPCT, Tableau1$TX_CHOM, Tableau1$FAIBREVPCT, main="Nuage de points 3D", xlab="Familles monoparentales (%)", ylab="Taux de chômage", zlab="Faible revenu (%)");


Matrice de corrélation

Pearson


In [19]:
rcorr(cbind(Tableau1$MONOPCT,Tableau1$MENAGE1PCT,Tableau1$TX_CHOM,Tableau1$FAIBREVPCT,Tableau1$Dist_Min,Tableau1$N_1000), type="pearson")


Out[19]:
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,]  1.00  0.27  0.50  0.68 -0.36  0.29
[2,]  0.27  1.00  0.27  0.50 -0.46  0.45
[3,]  0.50  0.27  1.00  0.76 -0.30  0.19
[4,]  0.68  0.50  0.76  1.00 -0.44  0.33
[5,] -0.36 -0.46 -0.30 -0.44  1.00 -0.61
[6,]  0.29  0.45  0.19  0.33 -0.61  1.00

n= 506 


P
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]       0    0    0    0    0  
[2,]  0         0    0    0    0  
[3,]  0    0         0    0    0  
[4,]  0    0    0         0    0  
[5,]  0    0    0    0         0  
[6,]  0    0    0    0    0       

Spearman


In [20]:
rcorr(cbind(Tableau1$MONOPCT,Tableau1$MENAGE1PCT,Tableau1$TX_CHOM,Tableau1$FAIBREVPCT,Tableau1$Dist_Min,Tableau1$N_1000), type="spearman")


Out[20]:
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,]  1.00  0.28  0.53  0.66 -0.37  0.35
[2,]  0.28  1.00  0.30  0.49 -0.47  0.48
[3,]  0.53  0.30  1.00  0.81 -0.35  0.31
[4,]  0.66  0.49  0.81  1.00 -0.44  0.42
[5,] -0.37 -0.47 -0.35 -0.44  1.00 -0.89
[6,]  0.35  0.48  0.31  0.42 -0.89  1.00

n= 506 


P
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]       0    0    0    0    0  
[2,]  0         0    0    0    0  
[3,]  0    0         0    0    0  
[4,]  0    0    0         0    0  
[5,]  0    0    0    0         0  
[6,]  0    0    0    0    0       

Régression linéaire simple


In [21]:
reg <- lm(TX_CHOM ~ FAIBREVPCT, data = Tableau1)
summary(reg)

names(Tableau1)


Out[21]:
Call:
lm(formula = TX_CHOM ~ FAIBREVPCT, data = Tableau1)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8828 -1.5430 -0.0331  1.2713 27.2094 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.197381   0.306152   7.177 2.56e-12 ***
FAIBREVPCT  0.242089   0.009222  26.252  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.958 on 504 degrees of freedom
Multiple R-squared:  0.5776,	Adjusted R-squared:  0.5768 
F-statistic: 689.2 on 1 and 504 DF,  p-value: < 2.2e-16
Out[21]:
  1. 'POPTOT_FR'
  2. 'FAIBLEREV'
  3. 'MONOPCT'
  4. 'MENAGE1PCT'
  5. 'IMMREC_PCT'
  6. 'TX_CHOM'
  7. 'NOECOLEPCT'
  8. 'SCO_M9PCT'
  9. 'SCO_M13PCT'
  10. 'PARTIELPCT'
  11. 'FAIBREVPCT'
  12. 'INDICE_PAU'
  13. 'Dist_Min'
  14. 'N_1000'
  15. 'Dist_Moy_3'
  16. 'Km2'
  17. 'HabKm2'
  18. 'SqrtDens'
  19. 'LogDens'
  20. 'SqrtImg'
  21. 'LogImg'
  22. 'Id'

Tableau de contingence


In [22]:
names(TableauKhi2)


Out[22]:
  1. 'SEX'
  2. 'DIST'
  3. 'Mode'

Modalités variables nominales


In [23]:
# sex
table(TableauKhi2$SEX)
TableauKhi2$SEX <- factor(TableauKhi2$SEX, levels = c(1,2), labels = c("Homme", "Femme"))
table(TableauKhi2$SEX)

# transport mode
table(TableauKhi2$Mode)
TableauKhi2$Mode <- factor(TableauKhi2$Mode, levels = c(0:4), labels = c("Auto (conducteur)", "Auto (passager)", "Transport en commun", "Tranport actif", "Autres"))
table(TableauKhi2$Mode)

# distance
table(TableauKhi2$DIST)
TableauKhi2$DIST <- factor(TableauKhi2$DIST, levels = c(1:7), labels = c("Moins de 5 km", "5 à 9,9 km","10 à 14,9 km", "15 à 19,9 km", "20 à 24,9 km", "25 à 29,9 km", "30 km et plus"))
table(TableauKhi2$DIST)


Out[23]:
    1     2 
14276 13035 
Out[23]:
Homme Femme 
14276 13035 
Out[23]:
    0     1     2     3     4 
11868   851  3993  1271  9328 
Out[23]:
  Auto (conducteur)     Auto (passager) Transport en commun      Tranport actif 
              11868                 851                3993                1271 
             Autres 
               9328 
Out[23]:
   1    2    3    4    5    6    7 
5412 4007 2720 1649 1154  597  833 
Out[23]:
Moins de 5 km    5 à 9,9 km  10 à 14,9 km  15 à 19,9 km  20 à 24,9 km 
         5412          4007          2720          1649          1154 
 25 à 29,9 km 30 km et plus 
          597           833 

Tableau de contingence


In [24]:
CrossTable(TableauKhi2$SEX, TableauKhi2$Mode, chisq=TRUE, expected=TRUE, resid=TRUE, format="SPSS")
CrossTable(TableauKhi2$SEX, TableauKhi2$DIST, chisq=TRUE, expected=TRUE, resid=TRUE, format="SPSS")
CrossTable(TableauKhi2$Mode, TableauKhi2$DIST, chisq=TRUE, expected=TRUE, resid=TRUE, format="SPSS")


   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
| Chi-square contribution |
|             Row Percent |
|          Column Percent |
|           Total Percent |
|                Residual |
|-------------------------|

Total Observations in Table:  27311 

                | TableauKhi2$Mode 
TableauKhi2$SEX |   Auto (conducteur)  |     Auto (passager)  | Transport en commun  |      Tranport actif  |              Autres  |           Row Total | 
----------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
          Homme |               5099  |                577  |               2321  |                661  |               5618  |              14276  | 
                |           6203.638  |            444.835  |           2087.220  |            664.377  |           4875.930  |                     | 
                |            196.695  |             39.268  |             26.185  |              0.017  |            112.936  |                     | 
                |             35.717% |              4.042% |             16.258% |              4.630% |             39.353% |             52.272% | 
                |             42.964% |             67.803% |             58.127% |             52.006% |             60.227% |                     | 
                |             18.670% |              2.113% |              8.498% |              2.420% |             20.570% |                     | 
                |          -1104.638  |            132.165  |            233.780  |             -3.377  |            742.070  |                     | 
----------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
          Femme |               6769  |                274  |               1672  |                610  |               3710  |              13035  | 
                |           5664.362  |            406.165  |           1905.780  |            606.623  |           4452.070  |                     | 
                |            215.422  |             43.006  |             28.678  |              0.019  |            123.688  |                     | 
                |             51.929% |              2.102% |             12.827% |              4.680% |             28.462% |             47.728% | 
                |             57.036% |             32.197% |             41.873% |             47.994% |             39.773% |                     | 
                |             24.785% |              1.003% |              6.122% |              2.234% |             13.584% |                     | 
                |           1104.638  |           -132.165  |           -233.780  |              3.377  |           -742.070  |                     | 
----------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
   Column Total |              11868  |                851  |               3993  |               1271  |               9328  |              27311  | 
                |             43.455% |              3.116% |             14.620% |              4.654% |             34.155% |                     | 
----------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  785.9131     d.f. =  4     p =  8.6413e-169 


 
       Minimum expected frequency: 406.1655 


   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
| Chi-square contribution |
|             Row Percent |
|          Column Percent |
|           Total Percent |
|                Residual |
|-------------------------|

Total Observations in Table:  16372 

                | TableauKhi2$DIST 
TableauKhi2$SEX | Moins de 5 km  |    5 à 9,9 km  |  10 à 14,9 km  |  15 à 19,9 km  |  20 à 24,9 km  |  25 à 29,9 km  | 30 km et plus  |     Row Total | 
----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
          Homme |         2976  |         2089  |         1340  |          765  |          498  |          241  |          321  |         8230  | 
                |     2720.545  |     2014.269  |     1367.310  |      828.932  |      580.101  |      300.104  |      418.739  |               | 
                |       23.987  |        2.773  |        0.545  |        4.931  |       11.620  |       11.640  |       22.813  |               | 
                |       36.160% |       25.383% |       16.282% |        9.295% |        6.051% |        2.928% |        3.900% |       50.269% | 
                |       54.989% |       52.134% |       49.265% |       46.392% |       43.154% |       40.369% |       38.535% |               | 
                |       18.177% |       12.760% |        8.185% |        4.673% |        3.042% |        1.472% |        1.961% |               | 
                |      255.455  |       74.731  |      -27.310  |      -63.932  |      -82.101  |      -59.104  |      -97.739  |               | 
----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
          Femme |         2436  |         1918  |         1380  |          884  |          656  |          356  |          512  |         8142  | 
                |     2691.455  |     1992.731  |     1352.690  |      820.068  |      573.899  |      296.896  |      414.261  |               | 
                |       24.246  |        2.803  |        0.551  |        4.984  |       11.745  |       11.766  |       23.060  |               | 
                |       29.919% |       23.557% |       16.949% |       10.857% |        8.057% |        4.372% |        6.288% |       49.731% | 
                |       45.011% |       47.866% |       50.735% |       53.608% |       56.846% |       59.631% |       61.465% |               | 
                |       14.879% |       11.715% |        8.429% |        5.399% |        4.007% |        2.174% |        3.127% |               | 
                |     -255.455  |      -74.731  |       27.310  |       63.932  |       82.101  |       59.104  |       97.739  |               | 
----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
   Column Total |         5412  |         4007  |         2720  |         1649  |         1154  |          597  |          833  |        16372  | 
                |       33.056% |       24.475% |       16.614% |       10.072% |        7.049% |        3.646% |        5.088% |               | 
----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  157.4649     d.f. =  6     p =  2.038182e-31 


 
       Minimum expected frequency: 296.8956 

Warning message:
In chisq.test(t, correct = FALSE, ...): Chi-squared approximation may be incorrect
   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
| Chi-square contribution |
|             Row Percent |
|          Column Percent |
|           Total Percent |
|                Residual |
|-------------------------|

Total Observations in Table:  16372 

                    | TableauKhi2$DIST 
   TableauKhi2$Mode | Moins de 5 km  |    5 à 9,9 km  |  10 à 14,9 km  |  15 à 19,9 km  |  20 à 24,9 km  |  25 à 29,9 km  | 30 km et plus  |     Row Total | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
  Auto (conducteur) |         2826  |         2467  |         1962  |         1262  |          927  |          456  |          664  |        10564  | 
                    |     3492.082  |     2585.509  |     1755.075  |     1064.014  |      744.616  |      385.213  |      537.492  |               | 
                    |      127.049  |        5.432  |       24.397  |       36.840  |       44.672  |       13.008  |       29.776  |               | 
                    |       26.751% |       23.353% |       18.573% |       11.946% |        8.775% |        4.317% |        6.285% |       64.525% | 
                    |       52.217% |       61.567% |       72.132% |       76.531% |       80.329% |       76.382% |       79.712% |               | 
                    |       17.261% |       15.068% |       11.984% |        7.708% |        5.662% |        2.785% |        4.056% |               | 
                    |     -666.082  |     -118.509  |      206.925  |      197.986  |      182.384  |       70.787  |      126.508  |               | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
    Auto (passager) |          248  |          203  |          124  |           82  |           40  |           35  |           34  |          766  | 
                    |      253.212  |      187.476  |      127.261  |       77.152  |       53.992  |       27.932  |       38.974  |               | 
                    |        0.107  |        1.285  |        0.084  |        0.305  |        3.626  |        1.789  |        0.635  |               | 
                    |       32.376% |       26.501% |       16.188% |       10.705% |        5.222% |        4.569% |        4.439% |        4.679% | 
                    |        4.582% |        5.066% |        4.559% |        4.973% |        3.466% |        5.863% |        4.082% |               | 
                    |        1.515% |        1.240% |        0.757% |        0.501% |        0.244% |        0.214% |        0.208% |               | 
                    |       -5.212  |       15.524  |       -3.261  |        4.848  |      -13.992  |        7.068  |       -4.974  |               | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
Transport en commun |         1234  |         1245  |          609  |          297  |          179  |          100  |           90  |         3754  | 
                    |     1240.939  |      918.781  |      623.679  |      378.106  |      264.605  |      136.888  |      191.002  |               | 
                    |        0.039  |      115.826  |        0.346  |       17.398  |       27.695  |        9.941  |       53.410  |               | 
                    |       32.872% |       33.165% |       16.223% |        7.912% |        4.768% |        2.664% |        2.397% |       22.929% | 
                    |       22.801% |       31.071% |       22.390% |       18.011% |       15.511% |       16.750% |       10.804% |               | 
                    |        7.537% |        7.604% |        3.720% |        1.814% |        1.093% |        0.611% |        0.550% |               | 
                    |       -6.939  |      326.219  |      -14.679  |      -81.106  |      -85.605  |      -36.888  |     -101.002  |               | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
     Tranport actif |         1063  |           87  |           21  |            6  |            5  |            5  |           35  |         1222  | 
                    |      403.950  |      299.081  |      203.020  |      123.081  |       86.134  |       44.560  |       62.175  |               | 
                    |     1075.251  |      150.389  |      163.192  |      111.373  |       76.424  |       35.121  |       11.877  |               | 
                    |       86.989% |        7.119% |        1.718% |        0.491% |        0.409% |        0.409% |        2.864% |        7.464% | 
                    |       19.642% |        2.171% |        0.772% |        0.364% |        0.433% |        0.838% |        4.202% |               | 
                    |        6.493% |        0.531% |        0.128% |        0.037% |        0.031% |        0.031% |        0.214% |               | 
                    |      659.050  |     -212.081  |     -182.020  |     -117.081  |      -81.134  |      -39.560  |      -27.175  |               | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
             Autres |           41  |            5  |            4  |            2  |            3  |            1  |           10  |           66  | 
                    |       21.817  |       16.153  |       10.965  |        6.648  |        4.652  |        2.407  |        3.358  |               | 
                    |       16.866  |        7.701  |        4.424  |        3.249  |        0.587  |        0.822  |       13.137  |               | 
                    |       62.121% |        7.576% |        6.061% |        3.030% |        4.545% |        1.515% |       15.152% |        0.403% | 
                    |        0.758% |        0.125% |        0.147% |        0.121% |        0.260% |        0.168% |        1.200% |               | 
                    |        0.250% |        0.031% |        0.024% |        0.012% |        0.018% |        0.006% |        0.061% |               | 
                    |       19.183  |      -11.153  |       -6.965  |       -4.648  |       -1.652  |       -1.407  |        6.642  |               | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
       Column Total |         5412  |         4007  |         2720  |         1649  |         1154  |          597  |          833  |        16372  | 
                    |       33.056% |       24.475% |       16.614% |       10.072% |        7.049% |        3.646% |        5.088% |               | 
--------------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  2184.073     d.f. =  24     p =  0 


 
       Minimum expected frequency: 2.40667 
Cells with Expected Frequency < 5: 3 of 35 (8.571429%)


Labo 4


In [2]:
# import
library(foreign)
library(nortest)
library(sas7bdat)
library(doBy)

# data
MTL <- read.sas7bdat("data/labo4/mtl_ttest.sas7bdat", debug=FALSE)
TOR <- read.sas7bdat("data/labo4/tor_ttest.sas7bdat", debug=FALSE)
VAN <- read.sas7bdat("data/labo4/van_ttest.sas7bdat", debug=FALSE)
TROISRMR <- read.sas7bdat("data/labo4/troisrmr_anova.sas7bdat", debug=FALSE)
names(MTL)
names(TOR)
names(VAN)
names(TROISRMR)


Loading required package: survival
Out[2]:
  1. 'CMA'
  2. 'SEX'
  3. 'TOTINC'
  4. 'WEIGHT'
  5. 'LogTotInc'
Out[2]:
  1. 'CMA'
  2. 'SEX'
  3. 'TOTINC'
  4. 'WEIGHT'
  5. 'LogTotInc'
Out[2]:
  1. 'CMA'
  2. 'SEX'
  3. 'TOTINC'
  4. 'WEIGHT'
  5. 'LogTotInc'
Out[2]:
  1. 'CMA'
  2. 'GROSRT'
  3. 'VALUE'
  4. 'HH_ID'

In [3]:
# modalités (labels)
table(MTL$SEX)
table(TOR$SEX)
table(VAN$SEX)
MTL$SEX <- factor(MTL$SEX, levels = c(1,2), labels = c("Homme", "Femme"))
TOR$SEX <- factor(TOR$SEX, levels = c(1,2), labels = c("Homme", "Femme"))
VAN$SEX <- factor(VAN$SEX, levels = c(1,2), labels = c("Homme", "Femme"))
table(MTL$SEX)
table(TOR$SEX)
table(VAN$SEX)

TROISRMR$CMA <- factor(TROISRMR$CMA, levels = c(462,535,933), labels = c("Montréal", "Toronto", "Vancouver"))
table(TROISRMR$CMA)


Out[3]:
    1     2 
13685 12698 
Out[3]:
    1     2 
18429 17013 
Out[3]:
   1    2 
7859 7327 
Out[3]:
Homme Femme 
13685 12698 
Out[3]:
Homme Femme 
18429 17013 
Out[3]:
Homme Femme 
 7859  7327 
Out[3]:
 Montréal   Toronto Vancouver 
    15173     17935      8103 

T-Test : Comparaison de moyennes

Test F

Vérification de l'égalité des variances


In [6]:
var.test(TOTINC ~ SEX, alternative='two.sided', conf.level=.95, data=MTL)


Out[6]:
	F test to compare two variances

data:  TOTINC by SEX
F = 0.1925, num df = 13684, denom df = 12697, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.1860374 0.1991902
sample estimates:
ratio of variances 
         0.1925036 

Interprétation

  • p-value < 2.2e-16
    • p < 0.05 alors méthode Satterthwaite
  • true ratio of variances is not equal to 1

Méthode Satterthwaite

Pas égales : P < 0,05

  • var.equal=FALSE

In [7]:
t.test(TOTINC~SEX, alternative='two.sided', conf.level=.95, var.equal=FALSE, data=MTL)


Out[7]:
	Welch Two Sample t-test

data:  TOTINC by SEX
t = -27.088, df = 17131, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -16456.17 -14235.32
sample estimates:
mean in group Homme mean in group Femme 
           29117.09            44462.84 

Interprétation

  • t = -27.088
  • p-value < 2.2e-16

Méthode Pooled

Égales : P >= 0,05

  • var.equal=TRUE

In [8]:
t.test(TOTINC~SEX, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=MTL)
boxplot(TOTINC~SEX, data = MTL, col = "coral", main="Boites à moustache (RMR de Montréal)", xlab="Sexe", ylab="Revenu total")
boxplot(LogTotInc~SEX, data = MTL, col = "coral", main="Boites à moustache (RMR de Montréal)", xlab="Sexe", ylab="Revenu total (log)")


Out[8]:
	Two Sample t-test

data:  TOTINC by SEX
t = -27.783, df = 26381, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -16428.36 -14263.13
sample estimates:
mean in group Homme mean in group Femme 
           29117.09            44462.84 

Interprétation

  • t = -27.783
  • p-value < 2.2e-16

Analyse des résultats

Contexte dataset, valeurs et comparaison des 2 moyennes des 2 modes de la variable qualitative, "la différence entre les moyennes (x) est d'ailleurs significative (t=27,09; P<0,001)".

ANOVA : Analyse de variance

Moyenne par groupe


In [9]:
# doBy
summaryBy(GROSRT ~ CMA, TROISRMR, FUN=c(mean), na.rm=TRUE)


Out[9]:
CMAGROSRT.mean
1Montréal622.3443
2Toronto858.7333
3Vancouver796.6063

Boxplot

Visualisation d'ANOVA


In [11]:
boxplot(GROSRT ~ CMA, data = TROISRMR, col = "lightyellow", main="Boites à moustache", xlab="Région métropolitaine", ylab="Loyer ($)")  #Analyse de variance : test F


ANOVA


In [13]:
anova.aov <- aov(GROSRT ~ CMA, data = TROISRMR)
summary(anova.aov)


Out[13]:
              Df    Sum Sq  Mean Sq F value Pr(>F)    
CMA            2 100629004 50314502     506 <2e-16 ***
Residuals   8379 833199213    99439                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
32829 observations deleted due to missingness

Interprétation

  • CMA Sum Sq = variance expliquée (inter)
  • Residuals Sum Sq = variance non expliquée (intra)
  • CMA Df = nombre de degrés de liberté pour variance expliquée (inter)
  • Residuals Df = nombre de degrés de liberté pour variance non expliquée (intra)
  • CMA F value = F observé
  • CMA Pr(>F) = Valeur de P rattachée à valeur de F

Test de F

Hypothèse H0 = "indépendance entre les deux variances (inter et intra)"

  • k = nombre de groupes
  • n = nombre d'observations

  • DL numérateur (VE, inter) de table de Fisher

    • k - 1
  • DL dénominateur (VNE, intra) de table de Fisher
    • n - k

Calcul F théorique (avec quelle méthode?)

  • F théorique
  • P associé au F théorique, seuils de signification
    • 95% : p=0,05
    • 99% : p=0,01
    • 99,9% : p=0,001

In [20]:
f_theorique <- qf(0.99, 2, 8379)
f_theorique
# qt() pour table Student t pour coefficient de ... (voir autres cours)


Out[20]:
4.60770215461875

Interprétation

  • F observé > à F théorique
    • moyennes sont statistiquement différentes
    • H0 rejeté
  • F observé < F théorique
    • moyennes des groupes ne sont pas différentes
    • H0 validée

Calcul R carré

Pour obtenir Coefficient de détermination


In [14]:
anova.r2 <- lm(GROSRT ~ CMA, data = TROISRMR)
summary(anova.r2)


Out[14]:
Call:
lm(formula = GROSRT ~ CMA, data = TROISRMR)

Residuals:
    Min      1Q  Median      3Q     Max 
-758.73 -158.73  -22.34  103.39 1161.39 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   622.344      4.760  130.73   <2e-16 ***
CMAToronto    236.389      7.751   30.50   <2e-16 ***
CMAVancouver  174.262      9.854   17.68   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 315.3 on 8379 degrees of freedom
  (32829 observations deleted due to missingness)
Multiple R-squared:  0.1078,	Adjusted R-squared:  0.1075 
F-statistic:   506 on 2 and 8379 DF,  p-value: < 2.2e-16

Interprétation

  • Multiple R-squared = Coefficient de détermination
    • la variable qualitative explique à x% la variation de la vaiable quantitative

Test de Tukey

Comparaison des moyennes groupes, 2 à 2


In [16]:
TukeyHSD(anova.aov)


Out[16]:
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = GROSRT ~ CMA, data = TROISRMR)

$CMA
                        diff       lwr       upr p adj
Toronto-Montréal   236.38891 218.22044 254.55738     0
Vancouver-Montréal 174.26194 151.16411 197.35976     0
Vancouver-Toronto  -62.12697 -86.91723 -37.33671     0