SlideShare una empresa de Scribd logo
1 de 19
Business Analytics and Insights
Final Project
Pallavi Herekar | Sonali Haldar
Introduction
• RMS Titanic was a British passenger liner that started its journey with
2200 passengers and four days later sank in the North Atlantic Ocean in
the early morning of 15th April 1912. Around 1500 people died and 700
survived the tragedy
• According to Encyclopedia Titanica, of the 712 survivors 500 were
passengers ( 369-women & children ,131-men) and 212 were crew (20-
women, 192-men)
Problem Statement
• Hypothesis: Certain sources claim that the survivors belonged to one of
the following categories– Women, Children and/or Upper Class
• Our problem to confirm if this hypothesis is true or not using the given
sample of 342 survivors data and derive conclusions using different
models in R
Data Visualization
Data Visualization
Data Acquisition & Processing
• Data source:
Kaggle - https://www.kaggle.com/c/titanic
• Data Processing
• Analyzed the data-types using str() function and converted some
factor data types
• Identified the columns with NA/NaN values
• Assigned the empty values with median or mode values of the
columns
• Populated the missing ages with median age
• Populated the empty Embarked values with Cherbough
• Populated the empty Sex values with Male as the distribution was mostly male
• Converted the values into factors before generating the Association
model
Data Analysis - Statistical Models
• General Linear Model
• Model 1 - Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + AgeGrp
• Model 2 - Pclass + Sex + Age
• Model 3 - poly(Age, 2) * Sex * Pclass + SibSp
• LDA Model
• QDA Model
• KNN Model
• Recursive Decision Tree
• Random Forest
• Association Rules
Model Result Matrix
Model Accuracy 95% Confidence Interval
GLM - Model 1 79.03 % (0.7365, 0.8375)
GLM - Model 2 78.28 % (0.7284, 0.8307)
GLM - Model 3 80.90 % (0.7566, 0.8543)
LDA 79.40% (0.7405, 0.8409)
QDA 77.53% (0.7204, 0.8239)
KNN 62.92 % (0.5682, 0.6873)
Decision Tree 78.65 % (0.7324, 0.8341)
Random Forest 81.65 % (0.7647, 0.861)
Result Interpretation - GLM - Model 1
• Model 1 indicates that Pclass, Sexmale has higher significance followed by Age and SibSp.
• Staying in 2nd class compared to 1st class and 3rd class compared to 2nd class reduces the
odds of survival by a factor of 1.3 respectively
• Being Male reduces the odds of survival by a factor of 2.7 compared to being female
• Age and SibSp reduces the odds of survival by a factor of .09 and .33 respectively
• In model 1, predictors: Parch, Fare, EmbarkedQ, EmbarkedS, AgeGrp are not statistically
significant
Result Interpretation - GLM - Model 2
• In Model 2, we just picked the statistically significant predictors based on the analysis
from Model 1
• Only considering the significant predictors decreased the accuracy from 79.03 % to 78.28
%. This signifies that we missed few other predictors that can also influence the
predictability. In this case it can be AgeGrp and EmbarkedS
Result Interpretation - GLM - Model 3
• Model 3 is formulated where Age is given much higher weightage
• The biggest differentials predictors are Sex, Pclass and SibSp
• The accuracy is highest using this model i.e. 80.9%
• 37 out of 186 were wrongly predicted which is error rate of 19.12% (Type I)
• Also only 14 out of 81 survivors were missed by this model i.e. 17.28% (Type II)
• This model clearly is the best predictor model for this data and shows that along with Age
other factors like Sex,PClass and SibSp are statistically significant
Result Interpretation - GLM using
ROC Curve
Model 1 - Black
Model 2 - Green
Model 3 - Blue
Area Under ROC
Curve:
Model 1 - 0.8321
Model 2 - 0.8291
Model 3 - 0.8309
Result Interpretation - LDA
• LDA was conducted on 70% of sample
(Train Set) and validated using 30% of
sample ( Validation Set) to predict the factors
affecting survivors
• LDA model indicates that the percentage of
accuracy is 79.4%
• 104 were predicted to survive of which
actually 73 survived. Hence 31 out of 170
were incorrectly identified to have survived.
24 survivors were missed by the model
Result Interpretation -QDA
• QDA was conducted on sample set as LDA
to predict the factors affecting survivors
• QDA model indicates that the percentage of
accuracy is 77.53%
• 104 were predicted to survive of which
actually 71 survived. Hence 33 out of 169
were incorrectly identified to have survived.
27 survivors were missed by this model
Result Interpretation -KNN
• We used K=1 for the KNN model. It does not
indicate which factors are significant in determining
the survivors
• The accuracy of KNN is least (62.92%) compared to
other classification methods which indicates that the
data may have a complex non-linear relationship
which cannot be explained by this non-parametric
method
• The confusion matrix indicates that 63 out of 190 i.e.
33.16% of actual survivors were predicted as non-
survivors (Type 1 error)
• However of the 77 survivors, 36 were missed by the
model i.e. 46.75% (Type II error) which implies that
this is not a good model to be used
Result Interpretation - Decision Tree
● Decision tree highlights that apart from
Sex and Pclass, Age, Fare, SibSp and
Embarked are also significant predictors
● The tree also highlights that females from
3rd class has 21 % survival rate
● The tree also highlights that males greater
the 6.5 years has 80% non survival rate
Result Interpretation - Random Forest
● The Mean Decrease Accuracy
parameter measure defined that Sex
has the highest significance to
predict the survival of passengers
followed by PClass, Fare and Age
Result Interpretation - Association Analysis
● Based on the Association Analysis,
we found that 3rd class male
passengers who embarked from
Southampton and are in the age
group of 20-30 years are 46.6% more
likely not to survive compared to
others.
Conclusion
• Comparing the accuracy of the different models, Random
Forest is the best followed by GLM-model3 is the best among
classification models where Age is considered as a second
degree polynomial
• Random Forest model highlights the importance of predictors
Sex, Pclass, Fare and Age
• GLM Model 3 have similar results where Age is given higher
weightage followed by Sex, Pclass, SibSp
• After analyzing all the models we can conclude that predictors
Sex, Pclass, Age did played a major role for Titanic survivors

Más contenido relacionado

La actualidad más candente

Titanic presentation main
Titanic presentation mainTitanic presentation main
Titanic presentation main
Badapple96
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
DataWorks Summit
 

La actualidad más candente (20)

Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Uplift Modeling: Optimize for Influence and Persuade by the Numbers
Uplift Modeling: Optimize for Influence and Persuade by the NumbersUplift Modeling: Optimize for Influence and Persuade by the Numbers
Uplift Modeling: Optimize for Influence and Persuade by the Numbers
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
The titanic disaster
The titanic disasterThe titanic disaster
The titanic disaster
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
Ultrasound nerve segmentation, kaggle review
Ultrasound nerve segmentation, kaggle reviewUltrasound nerve segmentation, kaggle review
Ultrasound nerve segmentation, kaggle review
 
Adaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud DetectionAdaptive Machine Learning for Credit Card Fraud Detection
Adaptive Machine Learning for Credit Card Fraud Detection
 
Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ Forecasting
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
 
Titantic+powerpoint
Titantic+powerpointTitantic+powerpoint
Titantic+powerpoint
 
Titanic presentation main
Titanic presentation mainTitanic presentation main
Titanic presentation main
 
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, ...
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Titanic
TitanicTitanic
Titanic
 
Machine learning
Machine learningMachine learning
Machine learning
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
 
Titanic
TitanicTitanic
Titanic
 
Explainable AI (XAI)
Explainable AI (XAI)Explainable AI (XAI)
Explainable AI (XAI)
 
TITANIC'S ETHICAL CASE STUDY
TITANIC'S ETHICAL CASE STUDYTITANIC'S ETHICAL CASE STUDY
TITANIC'S ETHICAL CASE STUDY
 

Destacado

Isaac's titanic presentation y2
Isaac's titanic presentation y2Isaac's titanic presentation y2
Isaac's titanic presentation y2
raygaff
 
Powerpoint Titanic
Powerpoint TitanicPowerpoint Titanic
Powerpoint Titanic
Luis Quiroz
 

Destacado (20)

Titanic Presentation
Titanic PresentationTitanic Presentation
Titanic Presentation
 
Titanic
TitanicTitanic
Titanic
 
Titanic
TitanicTitanic
Titanic
 
Titanic slideshare
Titanic slideshareTitanic slideshare
Titanic slideshare
 
TITANIC
TITANICTITANIC
TITANIC
 
Isaac's titanic presentation y2
Isaac's titanic presentation y2Isaac's titanic presentation y2
Isaac's titanic presentation y2
 
Chittoor.Sandeep
Chittoor.SandeepChittoor.Sandeep
Chittoor.Sandeep
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Mysqldbrentalgamesdb
MysqldbrentalgamesdbMysqldbrentalgamesdb
Mysqldbrentalgamesdb
 
Bike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay PatilBike Sharing Demand: Akshay Patil
Bike Sharing Demand: Akshay Patil
 
Final presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab KothariFinal presentation MIS 637 A - Rishab Kothari
Final presentation MIS 637 A - Rishab Kothari
 
RMS Titanic
RMS TitanicRMS Titanic
RMS Titanic
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
 
Forest Cover Type Prediction
Forest Cover Type PredictionForest Cover Type Prediction
Forest Cover Type Prediction
 
Powerpoint Titanic
Powerpoint TitanicPowerpoint Titanic
Powerpoint Titanic
 
Titanic slideshow
Titanic slideshowTitanic slideshow
Titanic slideshow
 
Titanic
TitanicTitanic
Titanic
 
Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7Advanced Predictive Modeling with R and RapidMiner Studio 7
Advanced Predictive Modeling with R and RapidMiner Studio 7
 
Titanic
TitanicTitanic
Titanic
 
Forest Cover type prediction
Forest Cover type predictionForest Cover type prediction
Forest Cover type prediction
 

Similar a Titanic - Presentation

202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization
FEG
 
External_Validation_Prediction_Models
External_Validation_Prediction_ModelsExternal_Validation_Prediction_Models
External_Validation_Prediction_Models
Eoin Gray
 
Multivariate Regression using Skull Structures
Multivariate Regression using Skull StructuresMultivariate Regression using Skull Structures
Multivariate Regression using Skull Structures
Justin Pierce
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
Madhumita Ghosh
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesDesign of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer Diseases
Mohamed Loey
 
Management of Multiple myeloma
Management of Multiple myelomaManagement of Multiple myeloma
Management of Multiple myeloma
orthoprinciples
 
CBS News60 MinutesVanity FairNational Poll, April #2, 2011.docx
CBS News60 MinutesVanity FairNational Poll, April #2, 2011.docxCBS News60 MinutesVanity FairNational Poll, April #2, 2011.docx
CBS News60 MinutesVanity FairNational Poll, April #2, 2011.docx
tidwellveronique
 

Similar a Titanic - Presentation (20)

202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization202312 Exploration of Data Analysis Visualization
202312 Exploration of Data Analysis Visualization
 
External_Validation_Prediction_Models
External_Validation_Prediction_ModelsExternal_Validation_Prediction_Models
External_Validation_Prediction_Models
 
SET PROJECT PPT.pptx
SET PROJECT PPT.pptxSET PROJECT PPT.pptx
SET PROJECT PPT.pptx
 
Multimorbidity Multistate Model
Multimorbidity Multistate ModelMultimorbidity Multistate Model
Multimorbidity Multistate Model
 
Multivariate Regression using Skull Structures
Multivariate Regression using Skull StructuresMultivariate Regression using Skull Structures
Multivariate Regression using Skull Structures
 
1. Introduction to Survival analysis
1. Introduction to Survival analysis1. Introduction to Survival analysis
1. Introduction to Survival analysis
 
Supplementing Random Forest Predictions with Observed Patterns in Titanic Data
Supplementing Random Forest Predictions with Observed Patterns in Titanic DataSupplementing Random Forest Predictions with Observed Patterns in Titanic Data
Supplementing Random Forest Predictions with Observed Patterns in Titanic Data
 
06 Hizoh aimradial20170921 Mortality risk
06 Hizoh aimradial20170921 Mortality risk06 Hizoh aimradial20170921 Mortality risk
06 Hizoh aimradial20170921 Mortality risk
 
Lung Cancer Risk Prediction Models
Lung Cancer Risk Prediction ModelsLung Cancer Risk Prediction Models
Lung Cancer Risk Prediction Models
 
Data Mining Project
Data Mining ProjectData Mining Project
Data Mining Project
 
Titanic
TitanicTitanic
Titanic
 
A Comparative Analysis of Genetic Algorithm Selection Techniques
A Comparative Analysis of Genetic Algorithm Selection TechniquesA Comparative Analysis of Genetic Algorithm Selection Techniques
A Comparative Analysis of Genetic Algorithm Selection Techniques
 
Harnessing Data to Improve Health Equity - Dr. Ali Mokdad
Harnessing Data to Improve Health Equity - Dr. Ali MokdadHarnessing Data to Improve Health Equity - Dr. Ali Mokdad
Harnessing Data to Improve Health Equity - Dr. Ali Mokdad
 
B017261117
B017261117B017261117
B017261117
 
Classification of Mammogram Images for Detection of Breast Cancer
Classification of Mammogram Images for Detection of Breast CancerClassification of Mammogram Images for Detection of Breast Cancer
Classification of Mammogram Images for Detection of Breast Cancer
 
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesDesign of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer Diseases
 
Management of Multiple myeloma
Management of Multiple myelomaManagement of Multiple myeloma
Management of Multiple myeloma
 
CBS News60 MinutesVanity FairNational Poll, April #2, 2011.docx
CBS News60 MinutesVanity FairNational Poll, April #2, 2011.docxCBS News60 MinutesVanity FairNational Poll, April #2, 2011.docx
CBS News60 MinutesVanity FairNational Poll, April #2, 2011.docx
 

Titanic - Presentation

  • 1. Business Analytics and Insights Final Project Pallavi Herekar | Sonali Haldar
  • 2. Introduction • RMS Titanic was a British passenger liner that started its journey with 2200 passengers and four days later sank in the North Atlantic Ocean in the early morning of 15th April 1912. Around 1500 people died and 700 survived the tragedy • According to Encyclopedia Titanica, of the 712 survivors 500 were passengers ( 369-women & children ,131-men) and 212 were crew (20- women, 192-men)
  • 3. Problem Statement • Hypothesis: Certain sources claim that the survivors belonged to one of the following categories– Women, Children and/or Upper Class • Our problem to confirm if this hypothesis is true or not using the given sample of 342 survivors data and derive conclusions using different models in R
  • 6. Data Acquisition & Processing • Data source: Kaggle - https://www.kaggle.com/c/titanic • Data Processing • Analyzed the data-types using str() function and converted some factor data types • Identified the columns with NA/NaN values • Assigned the empty values with median or mode values of the columns • Populated the missing ages with median age • Populated the empty Embarked values with Cherbough • Populated the empty Sex values with Male as the distribution was mostly male • Converted the values into factors before generating the Association model
  • 7. Data Analysis - Statistical Models • General Linear Model • Model 1 - Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + AgeGrp • Model 2 - Pclass + Sex + Age • Model 3 - poly(Age, 2) * Sex * Pclass + SibSp • LDA Model • QDA Model • KNN Model • Recursive Decision Tree • Random Forest • Association Rules
  • 8. Model Result Matrix Model Accuracy 95% Confidence Interval GLM - Model 1 79.03 % (0.7365, 0.8375) GLM - Model 2 78.28 % (0.7284, 0.8307) GLM - Model 3 80.90 % (0.7566, 0.8543) LDA 79.40% (0.7405, 0.8409) QDA 77.53% (0.7204, 0.8239) KNN 62.92 % (0.5682, 0.6873) Decision Tree 78.65 % (0.7324, 0.8341) Random Forest 81.65 % (0.7647, 0.861)
  • 9. Result Interpretation - GLM - Model 1 • Model 1 indicates that Pclass, Sexmale has higher significance followed by Age and SibSp. • Staying in 2nd class compared to 1st class and 3rd class compared to 2nd class reduces the odds of survival by a factor of 1.3 respectively • Being Male reduces the odds of survival by a factor of 2.7 compared to being female • Age and SibSp reduces the odds of survival by a factor of .09 and .33 respectively • In model 1, predictors: Parch, Fare, EmbarkedQ, EmbarkedS, AgeGrp are not statistically significant
  • 10. Result Interpretation - GLM - Model 2 • In Model 2, we just picked the statistically significant predictors based on the analysis from Model 1 • Only considering the significant predictors decreased the accuracy from 79.03 % to 78.28 %. This signifies that we missed few other predictors that can also influence the predictability. In this case it can be AgeGrp and EmbarkedS
  • 11. Result Interpretation - GLM - Model 3 • Model 3 is formulated where Age is given much higher weightage • The biggest differentials predictors are Sex, Pclass and SibSp • The accuracy is highest using this model i.e. 80.9% • 37 out of 186 were wrongly predicted which is error rate of 19.12% (Type I) • Also only 14 out of 81 survivors were missed by this model i.e. 17.28% (Type II) • This model clearly is the best predictor model for this data and shows that along with Age other factors like Sex,PClass and SibSp are statistically significant
  • 12. Result Interpretation - GLM using ROC Curve Model 1 - Black Model 2 - Green Model 3 - Blue Area Under ROC Curve: Model 1 - 0.8321 Model 2 - 0.8291 Model 3 - 0.8309
  • 13. Result Interpretation - LDA • LDA was conducted on 70% of sample (Train Set) and validated using 30% of sample ( Validation Set) to predict the factors affecting survivors • LDA model indicates that the percentage of accuracy is 79.4% • 104 were predicted to survive of which actually 73 survived. Hence 31 out of 170 were incorrectly identified to have survived. 24 survivors were missed by the model
  • 14. Result Interpretation -QDA • QDA was conducted on sample set as LDA to predict the factors affecting survivors • QDA model indicates that the percentage of accuracy is 77.53% • 104 were predicted to survive of which actually 71 survived. Hence 33 out of 169 were incorrectly identified to have survived. 27 survivors were missed by this model
  • 15. Result Interpretation -KNN • We used K=1 for the KNN model. It does not indicate which factors are significant in determining the survivors • The accuracy of KNN is least (62.92%) compared to other classification methods which indicates that the data may have a complex non-linear relationship which cannot be explained by this non-parametric method • The confusion matrix indicates that 63 out of 190 i.e. 33.16% of actual survivors were predicted as non- survivors (Type 1 error) • However of the 77 survivors, 36 were missed by the model i.e. 46.75% (Type II error) which implies that this is not a good model to be used
  • 16. Result Interpretation - Decision Tree ● Decision tree highlights that apart from Sex and Pclass, Age, Fare, SibSp and Embarked are also significant predictors ● The tree also highlights that females from 3rd class has 21 % survival rate ● The tree also highlights that males greater the 6.5 years has 80% non survival rate
  • 17. Result Interpretation - Random Forest ● The Mean Decrease Accuracy parameter measure defined that Sex has the highest significance to predict the survival of passengers followed by PClass, Fare and Age
  • 18. Result Interpretation - Association Analysis ● Based on the Association Analysis, we found that 3rd class male passengers who embarked from Southampton and are in the age group of 20-30 years are 46.6% more likely not to survive compared to others.
  • 19. Conclusion • Comparing the accuracy of the different models, Random Forest is the best followed by GLM-model3 is the best among classification models where Age is considered as a second degree polynomial • Random Forest model highlights the importance of predictors Sex, Pclass, Fare and Age • GLM Model 3 have similar results where Age is given higher weightage followed by Sex, Pclass, SibSp • After analyzing all the models we can conclude that predictors Sex, Pclass, Age did played a major role for Titanic survivors