2. Introduction
• RMS Titanic was a British passenger liner that started its journey with
2200 passengers and four days later sank in the North Atlantic Ocean in
the early morning of 15th April 1912. Around 1500 people died and 700
survived the tragedy
• According to Encyclopedia Titanica, of the 712 survivors 500 were
passengers ( 369-women & children ,131-men) and 212 were crew (20-
women, 192-men)
3. Problem Statement
• Hypothesis: Certain sources claim that the survivors belonged to one of
the following categories– Women, Children and/or Upper Class
• Our problem to confirm if this hypothesis is true or not using the given
sample of 342 survivors data and derive conclusions using different
models in R
6. Data Acquisition & Processing
• Data source:
Kaggle - https://www.kaggle.com/c/titanic
• Data Processing
• Analyzed the data-types using str() function and converted some
factor data types
• Identified the columns with NA/NaN values
• Assigned the empty values with median or mode values of the
columns
• Populated the missing ages with median age
• Populated the empty Embarked values with Cherbough
• Populated the empty Sex values with Male as the distribution was mostly male
• Converted the values into factors before generating the Association
model
7. Data Analysis - Statistical Models
• General Linear Model
• Model 1 - Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + AgeGrp
• Model 2 - Pclass + Sex + Age
• Model 3 - poly(Age, 2) * Sex * Pclass + SibSp
• LDA Model
• QDA Model
• KNN Model
• Recursive Decision Tree
• Random Forest
• Association Rules
8. Model Result Matrix
Model Accuracy 95% Confidence Interval
GLM - Model 1 79.03 % (0.7365, 0.8375)
GLM - Model 2 78.28 % (0.7284, 0.8307)
GLM - Model 3 80.90 % (0.7566, 0.8543)
LDA 79.40% (0.7405, 0.8409)
QDA 77.53% (0.7204, 0.8239)
KNN 62.92 % (0.5682, 0.6873)
Decision Tree 78.65 % (0.7324, 0.8341)
Random Forest 81.65 % (0.7647, 0.861)
9. Result Interpretation - GLM - Model 1
• Model 1 indicates that Pclass, Sexmale has higher significance followed by Age and SibSp.
• Staying in 2nd class compared to 1st class and 3rd class compared to 2nd class reduces the
odds of survival by a factor of 1.3 respectively
• Being Male reduces the odds of survival by a factor of 2.7 compared to being female
• Age and SibSp reduces the odds of survival by a factor of .09 and .33 respectively
• In model 1, predictors: Parch, Fare, EmbarkedQ, EmbarkedS, AgeGrp are not statistically
significant
10. Result Interpretation - GLM - Model 2
• In Model 2, we just picked the statistically significant predictors based on the analysis
from Model 1
• Only considering the significant predictors decreased the accuracy from 79.03 % to 78.28
%. This signifies that we missed few other predictors that can also influence the
predictability. In this case it can be AgeGrp and EmbarkedS
11. Result Interpretation - GLM - Model 3
• Model 3 is formulated where Age is given much higher weightage
• The biggest differentials predictors are Sex, Pclass and SibSp
• The accuracy is highest using this model i.e. 80.9%
• 37 out of 186 were wrongly predicted which is error rate of 19.12% (Type I)
• Also only 14 out of 81 survivors were missed by this model i.e. 17.28% (Type II)
• This model clearly is the best predictor model for this data and shows that along with Age
other factors like Sex,PClass and SibSp are statistically significant
12. Result Interpretation - GLM using
ROC Curve
Model 1 - Black
Model 2 - Green
Model 3 - Blue
Area Under ROC
Curve:
Model 1 - 0.8321
Model 2 - 0.8291
Model 3 - 0.8309
13. Result Interpretation - LDA
• LDA was conducted on 70% of sample
(Train Set) and validated using 30% of
sample ( Validation Set) to predict the factors
affecting survivors
• LDA model indicates that the percentage of
accuracy is 79.4%
• 104 were predicted to survive of which
actually 73 survived. Hence 31 out of 170
were incorrectly identified to have survived.
24 survivors were missed by the model
14. Result Interpretation -QDA
• QDA was conducted on sample set as LDA
to predict the factors affecting survivors
• QDA model indicates that the percentage of
accuracy is 77.53%
• 104 were predicted to survive of which
actually 71 survived. Hence 33 out of 169
were incorrectly identified to have survived.
27 survivors were missed by this model
15. Result Interpretation -KNN
• We used K=1 for the KNN model. It does not
indicate which factors are significant in determining
the survivors
• The accuracy of KNN is least (62.92%) compared to
other classification methods which indicates that the
data may have a complex non-linear relationship
which cannot be explained by this non-parametric
method
• The confusion matrix indicates that 63 out of 190 i.e.
33.16% of actual survivors were predicted as non-
survivors (Type 1 error)
• However of the 77 survivors, 36 were missed by the
model i.e. 46.75% (Type II error) which implies that
this is not a good model to be used
16. Result Interpretation - Decision Tree
● Decision tree highlights that apart from
Sex and Pclass, Age, Fare, SibSp and
Embarked are also significant predictors
● The tree also highlights that females from
3rd class has 21 % survival rate
● The tree also highlights that males greater
the 6.5 years has 80% non survival rate
17. Result Interpretation - Random Forest
● The Mean Decrease Accuracy
parameter measure defined that Sex
has the highest significance to
predict the survival of passengers
followed by PClass, Fare and Age
18. Result Interpretation - Association Analysis
● Based on the Association Analysis,
we found that 3rd class male
passengers who embarked from
Southampton and are in the age
group of 20-30 years are 46.6% more
likely not to survive compared to
others.
19. Conclusion
• Comparing the accuracy of the different models, Random
Forest is the best followed by GLM-model3 is the best among
classification models where Age is considered as a second
degree polynomial
• Random Forest model highlights the importance of predictors
Sex, Pclass, Fare and Age
• GLM Model 3 have similar results where Age is given higher
weightage followed by Sex, Pclass, SibSp
• After analyzing all the models we can conclude that predictors
Sex, Pclass, Age did played a major role for Titanic survivors