INTRODUCTION TO DATA SCIENCE
Aug, 2018
Discussion Document
By Subrata Saharia

1
Agenda
1. Introduction to Data science
2. Data science: Applications
3. Case Study : Who in the Titanic would have survived?

2
Data Science
Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from data in
various forms, both structured and unstructured ( Wikipedia)
• The most important thing in data science is to define ‘ The Problem Statement’
that we want to solve
• The second most important thing is the data to help answer the
question/problem
• What are some of the interesting problems that
data helps solve today?
• What are some of the challenges that easy
accessibility of data has posed today?
1

3
Types of data science questions
In approximate order of difficulty:
• Descriptive : Description of a set of data (http://saubhagya.gov.in/)
• Exploratory : Find relationships that were not known from the data
• Inferential : Use a relatively small sample of data to say something about a
bigger problem
• Predictive : Use the data of some variables to predict values for another
variable
• Causal : To find out what happens to one variable when we make changes to
another variable
• Mechanistic : Understand the exact changes in variables that lead to changes
in other variables for individual objects Advanced topics; Outside scope of this discussion
1

4
Exploratory Analytics
• Define the questions you want to get answered from the data
• Use visual techniques to extract insights, correlations and trends
Sub-process SLAs Histogram of Avg. Process TAT
What should be the sub-process SLA? What should be the Process SLA that can be shared with
clients?
Median: 10
Mean : 10.6
E2E TAT
Aggre
gate
Aug Sep Oct Nov Dec Jan
% within
10 days
50% 37% 71% 25% 38% 64% 60%
Mean 10.6 11 9 13 12 9 10
90th
Percentil
e
17 18 14 21 17 13 14
Histogram of E2E Process SLABoxplot for sub-process SLA
1
Receive Process 1 Process 2 Quality Complete

5
Inferential Analytics
Statistical Inference is defined as the process of “Generating conclusions about a Population from a
noisy Sample”
• Inference is commonly the goal of statistical models
• Inference estimates quantity as well as uncertainty in the estimate
• Inference depends heavily on the population as well as on the sampling
Key concepts used in Statistical Inference:
• Measures of Central Tendency : Mean / Median / Mode
• Measures of Variability : Range / Std. deviation / Variance
• Frequency distributions : Uniform / Binomial / Normal / etc.
• Measures of Statistical Significance ( Hypotheses Tests) : R-square/ T- tests / ANOVA/ etc.
Some examples where statistical inference is used for research:
• Does smoking causes cancer?
• Does a new treatment really reduce cancer progression?
• Does an ad campaign lead to increase in web traffic?
1

6
Inferential Analytics: The sample mean
Measures of central tendency
• Expected values characterize a distribution.
• The most useful expected value, the mean, characterizes the center of a density or mass function.
• Another expected value summary, the variance, characterizes how spread out a density is.
• Yet another expected value calculation is the skewness, which considers how much a density is
pulled toward high or low values.
• The sample or empirical mean estimates the population mean; the sample variance estimates the
population variance and the sample skewness estimates the population skewness
1
Density plots for dice rolls Density plots for coin flips
Distribution is centered around 3.5 Distribution is centered around 0.5
• The sample mean is an estimate of the population mean.
• The more data that goes into the sample mean, the more. concentrated its density function is around
the population mean.

7
Predictive Analytics
The goal of Predictive analytics is to predict an outcome variable based on characteristic variables of an
object/ observation
• If X predicts Y, it does not mean that X causes Y
• Accurate prediction depends heavily on measuring the right variables
• Although there are good/ bad prediction models, more data and simple models work reasonably well
Key steps in a basic predictive analysis
• What is the question that we want to solve?
• Procure the data and load the data into the platform
• Break the data into test and training sets : The training set is further broken down for cross validation
• Preprocess the data in the training set
• Design statistical/ predictive models to predict the outcome
• Train the model to predict using the training data and analyze results to further refine
• Test the final model accuracy on the test data
Some examples where predictive analytics is used:
• Who will win the World Cup, 2018?
 Goldman Sachs has predicted that Brazil will win the World cup
 Brazil has 18.5 percent probability of winning, with the French second best with 11.3 percent.
 200K statistical models, individual players’ performances and 1M simulations of the tournament
• Recommendations on Amazon’s page on likely items for you to order / likely books you might like
1

8
Agenda

9
Data science and machine learning is widely getting
used in Healthcare industry
Wearable technology to monitor and prevent health problems
• 2TB data generated daily by the human body: heart rate, sleep patterns, stress levels, brain activity
• Huge opportunity in preventive healthcare: 600K annual heart attacks, 30% obese population
• Monitors heart/ breathing patterns and anomalies are detected to predict possible disorders
Improving diagnostic accuracy
• Despite access to huge amount of data, 12million people are annually misdiagnosed in US
• Deep learning startup, Enlitic employs data science to increase the accuracy and efficiency of
diagnostics ( X-rays/ CT scans): 70 percent more accurate results, 50,000 times faster!
Patient care transformation into precision medicine
• An end to “one size fits all”: Precision medicine will open up opportunities for personalized treatment
• The core premise of precision medicine is that the integration of genetic data, medical data, and rich
dynamic phenotypic (biological/ physiological/ ) information will enable sophisticated patient
segmentation, revealing biologically distinct subgroups and pointing the way to targeted treatments
• Oncora Medical, a healthcare startup uses historic data from multiple cancer treatment centers and
patient’s individual EMR information to provide personalized treatment recommendations, depending
on the type of cancer, the patient’s previous health records and his current condition
2

10
Amazon knows what and when you will order next
Anticipatory Shipping
Amazon’s newly filed patent, “ Anticipatory Shipping” model is a system for delivering products to customers before they
place an order.
“Amazon says it may box and ship products it expects customers in a specific area will want—based on previous orders and
other factors… According to the patent, the packages could wait at the shippers’ hubs or be in transit in trucks assuming the
order is placed for the item, supplying the full address to the shipping company while the package is in transit.”
What do you think would be high level factors for the algorithm?
Frequency of similar orders Category of orders: daily use /
periodic use items
Past Demand of items from the
geography
Product search history
Page visits Time spent on page Wish list items Order history
Items left in the shopping cart Time mouse cursor hovers over
an item
Return history of customers Consumption time of order
categories
Advantages of this system would be many, but it has many practical issues
• Convenience is the big attraction of the system. Think of the feeling we get when we see that an Amazon package has
arrived at our door—it’s delightful and exciting. Those feelings are bound to amplify when we don’t know what’s in box
• Delivery time may go down to hours or even minutes on cases where the prediction accuracy is high
• With Anticipatory shipping, Amazon is taking over the responsibility of mundane shopping and relieving us of the hassle
• Issues may in the form of customers denying to accept the order. Anticipatory shipping plans on giving certain customers
discounts based on household price sensitivity—or even outright gifts—on products that customers received but don’t want
• Logistics companies may face uncertainty in this supply model and faulty predictions may back fire with costly returns
2

11
Agenda

Problem Description:
• The sinking of the RMS Titanic is one of the
most infamous shipwrecks in history. On April
15, 1912, during her maiden voyage, the Titanic
sank after colliding with an iceberg, killing 1502
out of 2224 passengers and crew.
• One of the reasons that the shipwreck led to
such loss of life was that there were not enough
lifeboats for the passengers and crew. Although
there was some element of luck involved in
surviving the sinking, some groups of people
were more likely to survive than others, such as
women, children, and the upper-class.
• The challenge, is to analyze and predict
which categories of people were likely to
survive
12
The Titanic Survival Problem
Note: This problem is one of the open competitions
live in Kaggle.com
For more details visit:
https://www.kaggle.com/c/titanic
3

Load the data
• Load the relevant libraries for the tools that we intend to use for the analysis
• The dataset would need to be loaded into the platform using input commands relevant to the dataset
 CSV/ Text/ excel/image files etc. would have different input commands specific to the dataset type
• Conduct basis exploration of the data fields by loading first few lines to see type of data. Read the data dictionary, if
available to see definitions of the dataset variables
13
Loading the data
Key things to check:
• Check if all observations are available or if
there are some empty/ NULL observations
• Check if there are some redundant /
duplicate variables
• Check if all variables are clearly defined in
the data dictionary
• Check format of the data
• Check the input and the output variables:
Here Survived is the output variable.
• Check which variables can be combined for
better model building : feature
reengineering
Data.frame : 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670
50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
3

Explore the data
• Form a few hypotheses that we need to check from the data to have a better understanding of the data
Sample hypotheses:
1. Females had better probability of survival
2. Upper class passengers had higher probability of survival
3. Children below 5 years age had high survival probability
14
Exploring the data
1
2 3
3

Create new features
• To check some of the hypotheses, we need to sometimes combine existing features to create a new variable
Sample hypotheses:
4. Larger families had higher survival probability
5. Some names with specific Titles have higher survival probability
15
Feature Re-engineering: Creating new features
4
Family = SibSp + Parch +1
Larger sized families >4 and solo travelers
mostly perished
5
The Titles are extracted from the name
variables
Less frequent titles are grouped into Others
3

Algorithms for Predictive Model Building
Classification Trees
Results: Accuracy 78.3%
Node details:
• TitleSimplified= Mr ( Accuracy: 61%)
• CabinGroup=A/C/F/G/T (Accuracy: 88%) - Survived
• CabinGroup=B/D/E (Accuracy: 48%)
• Age>=44 (Accuracy: 75%) - Survived
• Age<44 (Accuracy:68%) - Died
• TitieSimplified = Master/Miss/ etc.
• Family >=3.5 (Accuracy: 80%) - Survived | Family <3.5 ( Accuracy: 20%) - Died
What can we learn from the exploratory analysis?
• Confirm/ Reject hypotheses - The confirmed hypotheses will act as parameters to the model
• Discover new patterns – ex. ‘Title’ seems to be a good predictor for survival
• Observe correlation among parameters- need to eliminate one of the two to reduce noise in the model
• Find the missing values in some parameters – Find suitable methods to address the gaps through averages/ median etc.
• Find duplicate/ redundant parameters – These need to be eliminated from the model
16
Predictive Model Building (1/2) 3

Algorithms for Predictive Model Building
Random forests
Random forests provide an improvement decision trees, each time a split in a tree is considered, a random sample of m
predictors is chosen as split candidates from the full set of p predictors. The split is allowed to use only one of those m
predictors. A fresh sample of m predictors is taken at each split, and typically we choose m ≈ √p
Results : Accuracy : 80%
No. of variables tried at each split = 2
OOB error estimate =~16%
Confusion Matrix
17
Predictive Model Building (2/2)
Dead Survived Error
Dead 366 45 ~11%
Survi
ved
65 192 ~25%
3

18
Ensemble models for ‘Majority Vote’
Compare results of various algorithms
Algorithms Accuracy
Decision trees 78%
Random Forest 80%
Support Vector machine (linear) 75%
Majority voting mechanism to
arrive at the final score and
prediction:
Result : Accuracy 79%
Final Comments
• To ensemble more models can be enlisted such as Gradient Boosting, Logistic regression etc and the best
combination be found through iteration
3

Coursera: Data Science Specialization
https://www.coursera.org/specializations/jhu-data-science
Online pdfs:
https://leanpub.com/rprogramming
https://leanpub.com/LittleInferenceBook
https://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
Online datasets:
https://www.kaggle.com/datasets
https://www.dataquest.io/blog/free-datasets-for-projects/
19
Data Scientist resources

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Último

Último (20)