In [2]:
# read titanic data
train <- read.csv("./data/kaggle_titanic.csv")
str(train)
In [4]:
# divide train / test data
set.seed(57)
index.train <- sample(1:nrow(train), 800)
titanic.train <- train[index.train, ]
titanic.test <- train[-index.train, ]
tail(titanic.train)
In [5]:
model.titanic <- glm(Survived ~., family = binomial(link = 'logit'), data = titanic.train)
In [7]:
# summary(model.titanic)
In [8]:
model.titanic <- glm(Survived~Pclass+Age+SibSp+Parch+Fare,
family=binomial(link='logit'),data=titanic.train)
summary(model.titanic)
In [9]:
model.titanic <- glm(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,
family=binomial(link='logit'),data=titanic.train)
summary(model.titanic)
Embarked는 총 세가지 변수로 구성되어있는데, 이를 자동으로 새로운 column을 생성하여 fitting 된다.
In [11]:
titanic.train$age[is.na(titanic.train$Age)] <- mean(titanic.train$Age, na.rm=TRUE)
model.titanic_2 <- glm(Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked,
family=binomial(link='logit'),data=titanic.train)
summary(model.titanic_2)
평균값으로 대체하였으나 크게 유의하지 않음.
In [13]:
titanic.test.fit <- subset(titanic.test,select=c(3,5,6,7,8,10,12))
tail(titanic.test.fit)
In [ ]:
In [ ]: