01. 로지스틱 회귀분석

실습 (1) Titanic

kaggle Titanic Competition => kaggle_titanic.csv import


In [2]:
# read titanic data
train <- read.csv("./data/kaggle_titanic.csv")

str(train)


'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 416 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 525 596 662 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

In [4]:
# divide train / test data
set.seed(57)
index.train <- sample(1:nrow(train), 800)
titanic.train <- train[index.train, ]
titanic.test <- train[-index.train, ]
tail(titanic.train)


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
246246 0 1 Minahan, Dr. William Edwardmale 44.0 2 0 19928 90.0000 C78 Q
846846 0 3 Abbing, Mr. Anthony male 42.0 0 0 C.A. 5547 7.5500 S
357357 1 1 Bowerman, Miss. Elsie Edithfemale 22.0 0 1 113505 55.0000 E33 S
666666 0 2 Hickman, Mr. Lewis male 32.0 2 0 S.O.C. 14879 73.5000 S
885885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 S
844844 0 3 Lemberopolous, Mr. Peter L male 34.5 0 0 2683 6.4375 C

step 1: all variables with glm


In [5]:
model.titanic <- glm(Survived ~., family = binomial(link = 'logit'), data = titanic.train)


Warning message:
“glm.fit: algorithm did not converge”

In [7]:
# summary(model.titanic)

step 2: numeric variables with glm


In [8]:
model.titanic <- glm(Survived~Pclass+Age+SibSp+Parch+Fare,
                     family=binomial(link='logit'),data=titanic.train)
summary(model.titanic)


Call:
glm(formula = Survived ~ Pclass + Age + SibSp + Parch + Fare, 
    family = binomial(link = "logit"), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4088  -0.8504  -0.6104   0.9813   2.4071  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.682903   0.530049   6.948 3.70e-12 ***
Pclass      -1.212184   0.153942  -7.874 3.43e-15 ***
Age         -0.046144   0.007504  -6.149 7.79e-10 ***
SibSp       -0.327042   0.113071  -2.892  0.00382 ** 
Parch        0.235117   0.115339   2.038  0.04150 *  
Fare         0.002131   0.002569   0.829  0.40688    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 871.75  on 641  degrees of freedom
Residual deviance: 733.56  on 636  degrees of freedom
  (158 observations deleted due to missingness)
AIC: 745.56

Number of Fisher Scoring iterations: 4
  • Deviance Residuals: 잔차들의 분포
  • Coefficients
    • Estimate: 추정된 계수
    • Std. Error: 표준오차
    • z value: 표준오차와 표준편차로부터
    • Pr(>|z|): p-value
  • Residual deviance: 얼마나 모형에서 벗어났는지
  • Null deviance: (Intercept)만 고려할 때 얼마나 모형에서 벗어났는지

step 3: numeric var and categorical var


In [9]:
model.titanic <- glm(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,
                     family=binomial(link='logit'),data=titanic.train)
summary(model.titanic)


Call:
glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
    Fare + Embarked, family = binomial(link = "logit"), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6971  -0.6445  -0.3550   0.6335   2.4497  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.825e+01  6.067e+02   0.030  0.97601    
Pclass      -1.284e+00  1.755e-01  -7.315 2.57e-13 ***
Sexmale     -2.664e+00  2.378e-01 -11.205  < 2e-16 ***
Age         -4.500e-02  8.561e-03  -5.257 1.47e-07 ***
SibSp       -3.591e-01  1.389e-01  -2.585  0.00973 ** 
Parch       -6.654e-02  1.327e-01  -0.501  0.61604    
Fare        -8.036e-04  2.794e-03  -0.288  0.77363    
EmbarkedC   -1.206e+01  6.067e+02  -0.020  0.98414    
EmbarkedQ   -1.337e+01  6.067e+02  -0.022  0.98242    
EmbarkedS   -1.265e+01  6.067e+02  -0.021  0.98337    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 871.75  on 641  degrees of freedom
Residual deviance: 567.07  on 632  degrees of freedom
  (158 observations deleted due to missingness)
AIC: 587.07

Number of Fisher Scoring iterations: 13

Embarked는 총 세가지 변수로 구성되어있는데, 이를 자동으로 새로운 column을 생성하여 fitting 된다.

step 4: Age가 NA인 경우 평균값 사용


In [11]:
titanic.train$age[is.na(titanic.train$Age)] <- mean(titanic.train$Age, na.rm=TRUE)
model.titanic_2 <- glm(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,
                     family=binomial(link='logit'),data=titanic.train)
summary(model.titanic_2)


Call:
glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
    Fare + Embarked, family = binomial(link = "logit"), data = titanic.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6971  -0.6445  -0.3550   0.6335   2.4497  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.825e+01  6.067e+02   0.030  0.97601    
Pclass      -1.284e+00  1.755e-01  -7.315 2.57e-13 ***
Sexmale     -2.664e+00  2.378e-01 -11.205  < 2e-16 ***
Age         -4.500e-02  8.561e-03  -5.257 1.47e-07 ***
SibSp       -3.591e-01  1.389e-01  -2.585  0.00973 ** 
Parch       -6.654e-02  1.327e-01  -0.501  0.61604    
Fare        -8.036e-04  2.794e-03  -0.288  0.77363    
EmbarkedC   -1.206e+01  6.067e+02  -0.020  0.98414    
EmbarkedQ   -1.337e+01  6.067e+02  -0.022  0.98242    
EmbarkedS   -1.265e+01  6.067e+02  -0.021  0.98337    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 871.75  on 641  degrees of freedom
Residual deviance: 567.07  on 632  degrees of freedom
  (158 observations deleted due to missingness)
AIC: 587.07

Number of Fisher Scoring iterations: 13

평균값으로 대체하였으나 크게 유의하지 않음.

step 5: 예측에 사용한 변수 선택


In [13]:
titanic.test.fit <- subset(titanic.test,select=c(3,5,6,7,8,10,12))
tail(titanic.test.fit)


PclassSexAgeSibSpParchFareEmbarked
8293 male NA 0 0 7.7500Q
8533 female 9 1 1 15.2458C
8721 female 47 1 1 52.5542S
8752 female 28 1 0 24.0000C
8763 female 15 0 0 7.2250C
8801 female 56 0 1 83.1583C

In [ ]:


In [ ]: