SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
INTRODUCTION TO DATA SCIENCE
Aug, 2018
Discussion Document
By Subrata Saharia
1
Agenda
1. Introduction to Data science
2. Data science: Applications
3. Case Study : Who in the Titanic would have survived?
2
Data Science
Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from data in
various forms, both structured and unstructured ( Wikipedia)
• The most important thing in data science is to define ‘ The Problem Statement’
that we want to solve
• The second most important thing is the data to help answer the
question/problem
• What are some of the interesting problems that
data helps solve today?
• What are some of the challenges that easy
accessibility of data has posed today?
1
3
Types of data science questions
In approximate order of difficulty:
• Descriptive : Description of a set of data (http://saubhagya.gov.in/)
• Exploratory : Find relationships that were not known from the data
• Inferential : Use a relatively small sample of data to say something about a
bigger problem
• Predictive : Use the data of some variables to predict values for another
variable
• Causal : To find out what happens to one variable when we make changes to
another variable
• Mechanistic : Understand the exact changes in variables that lead to changes
in other variables for individual objects Advanced topics; Outside scope of this discussion
1
4
Exploratory Analytics
• Define the questions you want to get answered from the data
• Use visual techniques to extract insights, correlations and trends
Sub-process SLAs Histogram of Avg. Process TAT
What should be the sub-process SLA? What should be the Process SLA that can be shared with
clients?
Median: 10
Mean : 10.6
E2E TAT
Aggre
gate
Aug Sep Oct Nov Dec Jan
% within
10 days
50% 37% 71% 25% 38% 64% 60%
Mean 10.6 11 9 13 12 9 10
90th
Percentil
e
17 18 14 21 17 13 14
Histogram of E2E Process SLABoxplot for sub-process SLA
1
Receive Process 1 Process 2 Quality Complete
5
Inferential Analytics
Statistical Inference is defined as the process of “Generating conclusions about a Population from a
noisy Sample”
• Inference is commonly the goal of statistical models
• Inference estimates quantity as well as uncertainty in the estimate
• Inference depends heavily on the population as well as on the sampling
Key concepts used in Statistical Inference:
• Measures of Central Tendency : Mean / Median / Mode
• Measures of Variability : Range / Std. deviation / Variance
• Frequency distributions : Uniform / Binomial / Normal / etc.
• Measures of Statistical Significance ( Hypotheses Tests) : R-square/ T- tests / ANOVA/ etc.
Some examples where statistical inference is used for research:
• Does smoking causes cancer?
• Does a new treatment really reduce cancer progression?
• Does an ad campaign lead to increase in web traffic?
1
6
Inferential Analytics: The sample mean
Measures of central tendency
• Expected values characterize a distribution.
• The most useful expected value, the mean, characterizes the center of a density or mass function.
• Another expected value summary, the variance, characterizes how spread out a density is.
• Yet another expected value calculation is the skewness, which considers how much a density is
pulled toward high or low values.
• The sample or empirical mean estimates the population mean; the sample variance estimates the
population variance and the sample skewness estimates the population skewness
1
Density plots for dice rolls Density plots for coin flips
Distribution is centered around 3.5 Distribution is centered around 0.5
• The sample mean is an estimate of the population mean.
• The more data that goes into the sample mean, the more. concentrated its density function is around
the population mean.
7
Predictive Analytics
The goal of Predictive analytics is to predict an outcome variable based on characteristic variables of an
object/ observation
• If X predicts Y, it does not mean that X causes Y
• Accurate prediction depends heavily on measuring the right variables
• Although there are good/ bad prediction models, more data and simple models work reasonably well
Key steps in a basic predictive analysis
• What is the question that we want to solve?
• Procure the data and load the data into the platform
• Break the data into test and training sets : The training set is further broken down for cross validation
• Preprocess the data in the training set
• Design statistical/ predictive models to predict the outcome
• Train the model to predict using the training data and analyze results to further refine
• Test the final model accuracy on the test data
Some examples where predictive analytics is used:
• Who will win the World Cup, 2018?
 Goldman Sachs has predicted that Brazil will win the World cup
 Brazil has 18.5 percent probability of winning, with the French second best with 11.3 percent.
 200K statistical models, individual players’ performances and 1M simulations of the tournament
• Recommendations on Amazon’s page on likely items for you to order / likely books you might like
1
8
Agenda
1. Introduction to Data science
2. Data science: Applications
3. Case Study : Who in the Titanic would have survived?
9
Data science and machine learning is widely getting
used in Healthcare industry
Wearable technology to monitor and prevent health problems
• 2TB data generated daily by the human body: heart rate, sleep patterns, stress levels, brain activity
• Huge opportunity in preventive healthcare: 600K annual heart attacks, 30% obese population
• Monitors heart/ breathing patterns and anomalies are detected to predict possible disorders
Improving diagnostic accuracy
• Despite access to huge amount of data, 12million people are annually misdiagnosed in US
• Deep learning startup, Enlitic employs data science to increase the accuracy and efficiency of
diagnostics ( X-rays/ CT scans): 70 percent more accurate results, 50,000 times faster!
Patient care transformation into precision medicine
• An end to “one size fits all”: Precision medicine will open up opportunities for personalized treatment
• The core premise of precision medicine is that the integration of genetic data, medical data, and rich
dynamic phenotypic (biological/ physiological/ ) information will enable sophisticated patient
segmentation, revealing biologically distinct subgroups and pointing the way to targeted treatments
• Oncora Medical, a healthcare startup uses historic data from multiple cancer treatment centers and
patient’s individual EMR information to provide personalized treatment recommendations, depending
on the type of cancer, the patient’s previous health records and his current condition
2
10
Amazon knows what and when you will order next
Anticipatory Shipping
Amazon’s newly filed patent, “ Anticipatory Shipping” model is a system for delivering products to customers before they
place an order.
“Amazon says it may box and ship products it expects customers in a specific area will want—based on previous orders and
other factors… According to the patent, the packages could wait at the shippers’ hubs or be in transit in trucks assuming the
order is placed for the item, supplying the full address to the shipping company while the package is in transit.”
What do you think would be high level factors for the algorithm?
Frequency of similar orders Category of orders: daily use /
periodic use items
Past Demand of items from the
geography
Product search history
Page visits Time spent on page Wish list items Order history
Items left in the shopping cart Time mouse cursor hovers over
an item
Return history of customers Consumption time of order
categories
Advantages of this system would be many, but it has many practical issues
• Convenience is the big attraction of the system. Think of the feeling we get when we see that an Amazon package has
arrived at our door—it’s delightful and exciting. Those feelings are bound to amplify when we don’t know what’s in box
• Delivery time may go down to hours or even minutes on cases where the prediction accuracy is high
• With Anticipatory shipping, Amazon is taking over the responsibility of mundane shopping and relieving us of the hassle
• Issues may in the form of customers denying to accept the order. Anticipatory shipping plans on giving certain customers
discounts based on household price sensitivity—or even outright gifts—on products that customers received but don’t want
• Logistics companies may face uncertainty in this supply model and faulty predictions may back fire with costly returns
2
11
Agenda
1. Introduction to Data science
2. Data science: Applications
3. Case Study : Who in the Titanic would have survived?
Problem Description:
• The sinking of the RMS Titanic is one of the
most infamous shipwrecks in history. On April
15, 1912, during her maiden voyage, the Titanic
sank after colliding with an iceberg, killing 1502
out of 2224 passengers and crew.
• One of the reasons that the shipwreck led to
such loss of life was that there were not enough
lifeboats for the passengers and crew. Although
there was some element of luck involved in
surviving the sinking, some groups of people
were more likely to survive than others, such as
women, children, and the upper-class.
• The challenge, is to analyze and predict
which categories of people were likely to
survive
12
The Titanic Survival Problem
Note: This problem is one of the open competitions
live in Kaggle.com
For more details visit:
https://www.kaggle.com/c/titanic
3
Load the data
• Load the relevant libraries for the tools that we intend to use for the analysis
• The dataset would need to be loaded into the platform using input commands relevant to the dataset
 CSV/ Text/ excel/image files etc. would have different input commands specific to the dataset type
• Conduct basis exploration of the data fields by loading first few lines to see type of data. Read the data dictionary, if
available to see definitions of the dataset variables
13
Loading the data
Key things to check:
• Check if all observations are available or if
there are some empty/ NULL observations
• Check if there are some redundant /
duplicate variables
• Check if all variables are clearly defined in
the data dictionary
• Check format of the data
• Check the input and the output variables:
Here Survived is the output variable.
• Check which variables can be combined for
better model building : feature
reengineering
Data.frame : 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670
50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
3
Explore the data
• Form a few hypotheses that we need to check from the data to have a better understanding of the data
Sample hypotheses:
1. Females had better probability of survival
2. Upper class passengers had higher probability of survival
3. Children below 5 years age had high survival probability
14
Exploring the data
1
2 3
3
Create new features
• To check some of the hypotheses, we need to sometimes combine existing features to create a new variable
Sample hypotheses:
4. Larger families had higher survival probability
5. Some names with specific Titles have higher survival probability
15
Feature Re-engineering: Creating new features
4
Family = SibSp + Parch +1
Larger sized families >4 and solo travelers
mostly perished
5
The Titles are extracted from the name
variables
Less frequent titles are grouped into Others
3
Algorithms for Predictive Model Building
Classification Trees
Results: Accuracy 78.3%
Node details:
• TitleSimplified= Mr ( Accuracy: 61%)
• CabinGroup=A/C/F/G/T (Accuracy: 88%) - Survived
• CabinGroup=B/D/E (Accuracy: 48%)
• Age>=44 (Accuracy: 75%) - Survived
• Age<44 (Accuracy:68%) - Died
• TitieSimplified = Master/Miss/ etc.
• Family >=3.5 (Accuracy: 80%) - Survived | Family <3.5 ( Accuracy: 20%) - Died
What can we learn from the exploratory analysis?
• Confirm/ Reject hypotheses - The confirmed hypotheses will act as parameters to the model
• Discover new patterns – ex. ‘Title’ seems to be a good predictor for survival
• Observe correlation among parameters- need to eliminate one of the two to reduce noise in the model
• Find the missing values in some parameters – Find suitable methods to address the gaps through averages/ median etc.
• Find duplicate/ redundant parameters – These need to be eliminated from the model
16
Predictive Model Building (1/2) 3
Algorithms for Predictive Model Building
Random forests
Random forests provide an improvement decision trees, each time a split in a tree is considered, a random sample of m
predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m
predictors. A fresh sample of m predictors is taken at each split, and typically we choose m ≈ √p
Results : Accuracy : 80%
No. of variables tried at each split = 2
OOB error estimate =~16%
Confusion Matrix
17
Predictive Model Building (2/2)
Dead Survived Error
Dead 366 45 ~11%
Survi
ved
65 192 ~25%
3
18
Ensemble models for ‘Majority Vote’
Compare results of various algorithms
Algorithms Accuracy
Decision trees 78%
Random Forest 80%
Support Vector machine (linear) 75%
Majority voting mechanism to
arrive at the final score and
prediction:
Result : Accuracy 79%
Final Comments
• To ensemble more models can be enlisted such as Gradient Boosting, Logistic regression etc and the best
combination be found through iteration
3
Coursera: Data Science Specialization
https://www.coursera.org/specializations/jhu-data-science
Online pdfs:
https://leanpub.com/rprogramming
https://leanpub.com/LittleInferenceBook
https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
Online datasets:
https://www.kaggle.com/datasets
https://www.dataquest.io/blog/free-datasets-for-projects/
19
Data Scientist resources
20
Thanks!

Más contenido relacionado

La actualidad más candente

Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...Laure Wynants
 
Open science LMU session contribution E Steyerberg 2jul20
Open science LMU session contribution E Steyerberg 2jul20Open science LMU session contribution E Steyerberg 2jul20
Open science LMU session contribution E Steyerberg 2jul20Ewout Steyerberg
 
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AIStr-AI-ght to heaven? Pitfalls for clinical decision support based on AI
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AIBenVanCalster
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsMaarten van Smeden
 
The basics of prediction modeling
The basics of prediction modeling The basics of prediction modeling
The basics of prediction modeling Maarten van Smeden
 
Evaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk predictionEvaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk predictionEwout Steyerberg
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesGalit Shmueli
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchPaul Agapow
 
Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?Shea Swauger
 
Deep learning for episodic interventional data
Deep learning for episodic interventional dataDeep learning for episodic interventional data
Deep learning for episodic interventional dataDeakin University
 
Improving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxImproving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxMaarten van Smeden
 
Thoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial IntelligenceThoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial IntelligenceMaarten van Smeden
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical ResearchPaul Agapow
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker StrategiesTom Plasterer
 
Machine Learning and the Value of Health Technologies
Machine Learning and the Value of Health TechnologiesMachine Learning and the Value of Health Technologies
Machine Learning and the Value of Health TechnologiesCovance
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsElinor Velasquez
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
Statistics pres 10 27 2015 roy sabo
Statistics pres 10 27 2015   roy saboStatistics pres 10 27 2015   roy sabo
Statistics pres 10 27 2015 roy sabotjcarter
 
7 excellent reasons why statistics are important statswork
7 excellent reasons why statistics are important   statswork7 excellent reasons why statistics are important   statswork
7 excellent reasons why statistics are important statsworkStats Statswork
 

La actualidad más candente (20)

Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...Prediction research in a pandemic: 3 lessons from a living systematic review ...
Prediction research in a pandemic: 3 lessons from a living systematic review ...
 
Open science LMU session contribution E Steyerberg 2jul20
Open science LMU session contribution E Steyerberg 2jul20Open science LMU session contribution E Steyerberg 2jul20
Open science LMU session contribution E Steyerberg 2jul20
 
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AIStr-AI-ght to heaven? Pitfalls for clinical decision support based on AI
Str-AI-ght to heaven? Pitfalls for clinical decision support based on AI
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutions
 
The basics of prediction modeling
The basics of prediction modeling The basics of prediction modeling
The basics of prediction modeling
 
Evaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk predictionEvaluation of the clinical value of biomarkers for risk prediction
Evaluation of the clinical value of biomarkers for risk prediction
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False Discoveries
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
 
Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?Big Data: Big Opportunities or Big Trouble?
Big Data: Big Opportunities or Big Trouble?
 
Deep learning for episodic interventional data
Deep learning for episodic interventional dataDeep learning for episodic interventional data
Deep learning for episodic interventional data
 
Improving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxImproving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradox
 
Thoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial IntelligenceThoughts on Machine Learning and Artificial Intelligence
Thoughts on Machine Learning and Artificial Intelligence
 
Machine Learning for Preclinical Research
Machine Learning for Preclinical ResearchMachine Learning for Preclinical Research
Machine Learning for Preclinical Research
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
Machine Learning and the Value of Health Technologies
Machine Learning and the Value of Health TechnologiesMachine Learning and the Value of Health Technologies
Machine Learning and the Value of Health Technologies
 
Data mining intro-2009-v2
Data mining intro-2009-v2Data mining intro-2009-v2
Data mining intro-2009-v2
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Statistics pres 10 27 2015 roy sabo
Statistics pres 10 27 2015   roy saboStatistics pres 10 27 2015   roy sabo
Statistics pres 10 27 2015 roy sabo
 
7 excellent reasons why statistics are important statswork
7 excellent reasons why statistics are important   statswork7 excellent reasons why statistics are important   statswork
7 excellent reasons why statistics are important statswork
 

Similar a Here are the steps to load and explore the Titanic dataset:1. Import necessary libraries like pandas, numpy etc for data manipulation and analysis 2. Load the CSV file containing the passenger data into a DataFrame 3. Check the first few rows of data to get a sense of columns and data types4. Print the shape of the DataFrame to know number of rows and columns 5. Check for any null values present using .isnull() and .sum()6. Print the data types of each column to check for appropriate types7. Print a description of the dataset which includes count, mean, std, min, 25%, 50%, 75%, max for each column8

Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfvishal choudhary
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Aed1222 lesson 1 and 3
Aed1222 lesson 1 and 3Aed1222 lesson 1 and 3
Aed1222 lesson 1 and 3nurun2010
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
AMDIS CHIME Fall Symposium
AMDIS CHIME Fall SymposiumAMDIS CHIME Fall Symposium
AMDIS CHIME Fall SymposiumDale Sanders
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementMark Reed
 
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...Koray Atalag
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Exercise 1  Risk Analysis Before you begin this assignment, be .docx
Exercise 1  Risk Analysis Before you begin this assignment, be .docxExercise 1  Risk Analysis Before you begin this assignment, be .docx
Exercise 1  Risk Analysis Before you begin this assignment, be .docxgitagrimston
 
Space Situational Awareness Forum - U.S Air Force Presentation
Space Situational Awareness Forum - U.S Air Force PresentationSpace Situational Awareness Forum - U.S Air Force Presentation
Space Situational Awareness Forum - U.S Air Force PresentationSpace_Situational_Awareness
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Steffen Frederiksen: DATA, DITA, DOCX
Steffen Frederiksen: DATA, DITA, DOCXSteffen Frederiksen: DATA, DITA, DOCX
Steffen Frederiksen: DATA, DITA, DOCXJack Molisani
 
Bigdatapdi2015 150112111012-conversion-gate02
Bigdatapdi2015 150112111012-conversion-gate02Bigdatapdi2015 150112111012-conversion-gate02
Bigdatapdi2015 150112111012-conversion-gate02soniamra
 

Similar a Here are the steps to load and explore the Titanic dataset:1. Import necessary libraries like pandas, numpy etc for data manipulation and analysis 2. Load the CSV file containing the passenger data into a DataFrame 3. Check the first few rows of data to get a sense of columns and data types4. Print the shape of the DataFrame to know number of rows and columns 5. Check for any null values present using .isnull() and .sum()6. Print the data types of each column to check for appropriate types7. Print a description of the dataset which includes count, mean, std, min, 25%, 50%, 75%, max for each column8 (20)

Data science 101
Data science 101Data science 101
Data science 101
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
Cri big data
Cri big dataCri big data
Cri big data
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
BIG DATA.ppt
BIG DATA.pptBIG DATA.ppt
BIG DATA.ppt
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Aed1222 lesson 1 and 3
Aed1222 lesson 1 and 3Aed1222 lesson 1 and 3
Aed1222 lesson 1 and 3
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
AMDIS CHIME Fall Symposium
AMDIS CHIME Fall SymposiumAMDIS CHIME Fall Symposium
AMDIS CHIME Fall Symposium
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
 
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...
openEHR Approach to Detailed Clinical Models (DCM) Development - Lessons Lear...
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Exercise 1  Risk Analysis Before you begin this assignment, be .docx
Exercise 1  Risk Analysis Before you begin this assignment, be .docxExercise 1  Risk Analysis Before you begin this assignment, be .docx
Exercise 1  Risk Analysis Before you begin this assignment, be .docx
 
Space Situational Awareness Forum - U.S Air Force Presentation
Space Situational Awareness Forum - U.S Air Force PresentationSpace Situational Awareness Forum - U.S Air Force Presentation
Space Situational Awareness Forum - U.S Air Force Presentation
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Steffen Frederiksen: DATA, DITA, DOCX
Steffen Frederiksen: DATA, DITA, DOCXSteffen Frederiksen: DATA, DITA, DOCX
Steffen Frederiksen: DATA, DITA, DOCX
 
Bigdatapdi2015 150112111012-conversion-gate02
Bigdatapdi2015 150112111012-conversion-gate02Bigdatapdi2015 150112111012-conversion-gate02
Bigdatapdi2015 150112111012-conversion-gate02
 

Último

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Último (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Here are the steps to load and explore the Titanic dataset:1. Import necessary libraries like pandas, numpy etc for data manipulation and analysis 2. Load the CSV file containing the passenger data into a DataFrame 3. Check the first few rows of data to get a sense of columns and data types4. Print the shape of the DataFrame to know number of rows and columns 5. Check for any null values present using .isnull() and .sum()6. Print the data types of each column to check for appropriate types7. Print a description of the dataset which includes count, mean, std, min, 25%, 50%, 75%, max for each column8

  • 1. INTRODUCTION TO DATA SCIENCE Aug, 2018 Discussion Document By Subrata Saharia
  • 2. 1 Agenda 1. Introduction to Data science 2. Data science: Applications 3. Case Study : Who in the Titanic would have survived?
  • 3. 2 Data Science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured ( Wikipedia) • The most important thing in data science is to define ‘ The Problem Statement’ that we want to solve • The second most important thing is the data to help answer the question/problem • What are some of the interesting problems that data helps solve today? • What are some of the challenges that easy accessibility of data has posed today? 1
  • 4. 3 Types of data science questions In approximate order of difficulty: • Descriptive : Description of a set of data (http://saubhagya.gov.in/) • Exploratory : Find relationships that were not known from the data • Inferential : Use a relatively small sample of data to say something about a bigger problem • Predictive : Use the data of some variables to predict values for another variable • Causal : To find out what happens to one variable when we make changes to another variable • Mechanistic : Understand the exact changes in variables that lead to changes in other variables for individual objects Advanced topics; Outside scope of this discussion 1
  • 5. 4 Exploratory Analytics • Define the questions you want to get answered from the data • Use visual techniques to extract insights, correlations and trends Sub-process SLAs Histogram of Avg. Process TAT What should be the sub-process SLA? What should be the Process SLA that can be shared with clients? Median: 10 Mean : 10.6 E2E TAT Aggre gate Aug Sep Oct Nov Dec Jan % within 10 days 50% 37% 71% 25% 38% 64% 60% Mean 10.6 11 9 13 12 9 10 90th Percentil e 17 18 14 21 17 13 14 Histogram of E2E Process SLABoxplot for sub-process SLA 1 Receive Process 1 Process 2 Quality Complete
  • 6. 5 Inferential Analytics Statistical Inference is defined as the process of “Generating conclusions about a Population from a noisy Sample” • Inference is commonly the goal of statistical models • Inference estimates quantity as well as uncertainty in the estimate • Inference depends heavily on the population as well as on the sampling Key concepts used in Statistical Inference: • Measures of Central Tendency : Mean / Median / Mode • Measures of Variability : Range / Std. deviation / Variance • Frequency distributions : Uniform / Binomial / Normal / etc. • Measures of Statistical Significance ( Hypotheses Tests) : R-square/ T- tests / ANOVA/ etc. Some examples where statistical inference is used for research: • Does smoking causes cancer? • Does a new treatment really reduce cancer progression? • Does an ad campaign lead to increase in web traffic? 1
  • 7. 6 Inferential Analytics: The sample mean Measures of central tendency • Expected values characterize a distribution. • The most useful expected value, the mean, characterizes the center of a density or mass function. • Another expected value summary, the variance, characterizes how spread out a density is. • Yet another expected value calculation is the skewness, which considers how much a density is pulled toward high or low values. • The sample or empirical mean estimates the population mean; the sample variance estimates the population variance and the sample skewness estimates the population skewness 1 Density plots for dice rolls Density plots for coin flips Distribution is centered around 3.5 Distribution is centered around 0.5 • The sample mean is an estimate of the population mean. • The more data that goes into the sample mean, the more. concentrated its density function is around the population mean.
  • 8. 7 Predictive Analytics The goal of Predictive analytics is to predict an outcome variable based on characteristic variables of an object/ observation • If X predicts Y, it does not mean that X causes Y • Accurate prediction depends heavily on measuring the right variables • Although there are good/ bad prediction models, more data and simple models work reasonably well Key steps in a basic predictive analysis • What is the question that we want to solve? • Procure the data and load the data into the platform • Break the data into test and training sets : The training set is further broken down for cross validation • Preprocess the data in the training set • Design statistical/ predictive models to predict the outcome • Train the model to predict using the training data and analyze results to further refine • Test the final model accuracy on the test data Some examples where predictive analytics is used: • Who will win the World Cup, 2018?  Goldman Sachs has predicted that Brazil will win the World cup  Brazil has 18.5 percent probability of winning, with the French second best with 11.3 percent.  200K statistical models, individual players’ performances and 1M simulations of the tournament • Recommendations on Amazon’s page on likely items for you to order / likely books you might like 1
  • 9. 8 Agenda 1. Introduction to Data science 2. Data science: Applications 3. Case Study : Who in the Titanic would have survived?
  • 10. 9 Data science and machine learning is widely getting used in Healthcare industry Wearable technology to monitor and prevent health problems • 2TB data generated daily by the human body: heart rate, sleep patterns, stress levels, brain activity • Huge opportunity in preventive healthcare: 600K annual heart attacks, 30% obese population • Monitors heart/ breathing patterns and anomalies are detected to predict possible disorders Improving diagnostic accuracy • Despite access to huge amount of data, 12million people are annually misdiagnosed in US • Deep learning startup, Enlitic employs data science to increase the accuracy and efficiency of diagnostics ( X-rays/ CT scans): 70 percent more accurate results, 50,000 times faster! Patient care transformation into precision medicine • An end to “one size fits all”: Precision medicine will open up opportunities for personalized treatment • The core premise of precision medicine is that the integration of genetic data, medical data, and rich dynamic phenotypic (biological/ physiological/ ) information will enable sophisticated patient segmentation, revealing biologically distinct subgroups and pointing the way to targeted treatments • Oncora Medical, a healthcare startup uses historic data from multiple cancer treatment centers and patient’s individual EMR information to provide personalized treatment recommendations, depending on the type of cancer, the patient’s previous health records and his current condition 2
  • 11. 10 Amazon knows what and when you will order next Anticipatory Shipping Amazon’s newly filed patent, “ Anticipatory Shipping” model is a system for delivering products to customers before they place an order. “Amazon says it may box and ship products it expects customers in a specific area will want—based on previous orders and other factors… According to the patent, the packages could wait at the shippers’ hubs or be in transit in trucks assuming the order is placed for the item, supplying the full address to the shipping company while the package is in transit.” What do you think would be high level factors for the algorithm? Frequency of similar orders Category of orders: daily use / periodic use items Past Demand of items from the geography Product search history Page visits Time spent on page Wish list items Order history Items left in the shopping cart Time mouse cursor hovers over an item Return history of customers Consumption time of order categories Advantages of this system would be many, but it has many practical issues • Convenience is the big attraction of the system. Think of the feeling we get when we see that an Amazon package has arrived at our door—it’s delightful and exciting. Those feelings are bound to amplify when we don’t know what’s in box • Delivery time may go down to hours or even minutes on cases where the prediction accuracy is high • With Anticipatory shipping, Amazon is taking over the responsibility of mundane shopping and relieving us of the hassle • Issues may in the form of customers denying to accept the order. Anticipatory shipping plans on giving certain customers discounts based on household price sensitivity—or even outright gifts—on products that customers received but don’t want • Logistics companies may face uncertainty in this supply model and faulty predictions may back fire with costly returns 2
  • 12. 11 Agenda 1. Introduction to Data science 2. Data science: Applications 3. Case Study : Who in the Titanic would have survived?
  • 13. Problem Description: • The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. • One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. • The challenge, is to analyze and predict which categories of people were likely to survive 12 The Titanic Survival Problem Note: This problem is one of the open competitions live in Kaggle.com For more details visit: https://www.kaggle.com/c/titanic 3
  • 14. Load the data • Load the relevant libraries for the tools that we intend to use for the analysis • The dataset would need to be loaded into the platform using input commands relevant to the dataset  CSV/ Text/ excel/image files etc. would have different input commands specific to the dataset type • Conduct basis exploration of the data fields by loading first few lines to see type of data. Read the data dictionary, if available to see definitions of the dataset variables 13 Loading the data Key things to check: • Check if all observations are available or if there are some empty/ NULL observations • Check if there are some redundant / duplicate variables • Check if all variables are clearly defined in the data dictionary • Check format of the data • Check the input and the output variables: Here Survived is the output variable. • Check which variables can be combined for better model building : feature reengineering Data.frame : 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ... $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ... $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ... 3
  • 15. Explore the data • Form a few hypotheses that we need to check from the data to have a better understanding of the data Sample hypotheses: 1. Females had better probability of survival 2. Upper class passengers had higher probability of survival 3. Children below 5 years age had high survival probability 14 Exploring the data 1 2 3 3
  • 16. Create new features • To check some of the hypotheses, we need to sometimes combine existing features to create a new variable Sample hypotheses: 4. Larger families had higher survival probability 5. Some names with specific Titles have higher survival probability 15 Feature Re-engineering: Creating new features 4 Family = SibSp + Parch +1 Larger sized families >4 and solo travelers mostly perished 5 The Titles are extracted from the name variables Less frequent titles are grouped into Others 3
  • 17. Algorithms for Predictive Model Building Classification Trees Results: Accuracy 78.3% Node details: • TitleSimplified= Mr ( Accuracy: 61%) • CabinGroup=A/C/F/G/T (Accuracy: 88%) - Survived • CabinGroup=B/D/E (Accuracy: 48%) • Age>=44 (Accuracy: 75%) - Survived • Age<44 (Accuracy:68%) - Died • TitieSimplified = Master/Miss/ etc. • Family >=3.5 (Accuracy: 80%) - Survived | Family <3.5 ( Accuracy: 20%) - Died What can we learn from the exploratory analysis? • Confirm/ Reject hypotheses - The confirmed hypotheses will act as parameters to the model • Discover new patterns – ex. ‘Title’ seems to be a good predictor for survival • Observe correlation among parameters- need to eliminate one of the two to reduce noise in the model • Find the missing values in some parameters – Find suitable methods to address the gaps through averages/ median etc. • Find duplicate/ redundant parameters – These need to be eliminated from the model 16 Predictive Model Building (1/2) 3
  • 18. Algorithms for Predictive Model Building Random forests Random forests provide an improvement decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m predictors. A fresh sample of m predictors is taken at each split, and typically we choose m ≈ √p Results : Accuracy : 80% No. of variables tried at each split = 2 OOB error estimate =~16% Confusion Matrix 17 Predictive Model Building (2/2) Dead Survived Error Dead 366 45 ~11% Survi ved 65 192 ~25% 3
  • 19. 18 Ensemble models for ‘Majority Vote’ Compare results of various algorithms Algorithms Accuracy Decision trees 78% Random Forest 80% Support Vector machine (linear) 75% Majority voting mechanism to arrive at the final score and prediction: Result : Accuracy 79% Final Comments • To ensemble more models can be enlisted such as Gradient Boosting, Logistic regression etc and the best combination be found through iteration 3
  • 20. Coursera: Data Science Specialization https://www.coursera.org/specializations/jhu-data-science Online pdfs: https://leanpub.com/rprogramming https://leanpub.com/LittleInferenceBook https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf Online datasets: https://www.kaggle.com/datasets https://www.dataquest.io/blog/free-datasets-for-projects/ 19 Data Scientist resources