Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

What’s in it for you?
What is Data Science? Who is a Data Scientist?
What are the skills of a Data Scientist?
What does a Data Scientist do?
Data Acquisition
Data Preparation
Data Mining with Tableau
Model Building & Testing
Model Maintenance
Summary

What is Data Science?
Data Science is the area of study which involves extracting knowledge
from all the data you can gather

Hence, translating a business problem into a research project, and then translating the results
back into a practical solution requires a deep understanding of the business domain and
creativity

Fraudulent transactions are ramping on the internet and the results can be devastating

Data Science can help in fraud detection using advanced machine learning algorithms and prevent great
monetary losses

And people doing all this work are called Data Scientists

Model Maintenance
Data Acquisition
Data Preparation
Data Mining
Data Modelling
Model Maintenance

Data Acquisition
Data Acquisition involves
extracting data from multiple
source systems
Integrating and transforming the
data into homogeneous format
Loading the transformed data
into data warehouse
Multiple source
systems (Database,
FlatFiles etc)

Data Acquisition
source systems
into data warehouse
Transformation

Data Acquisition
source systems
into data warehouse
Data Warehouse

Data Acquisition
Data warehouse
Multiple source
systems (Database,
FlatFiles etc.)
Transformation

Data Acquisition
Data warehouse
Multiple source
systems (Database,
FlatFiles etc)
Transformation
This process is mainly referred
as ETL (Extract, Transform,
Load)

Data Acquisition
Various tools are available for ETL process like Talend Studio,
DataStage, Informatica

Model Maintenance
Data Acquisition
Data Preparation
Data Mining
Model Building
Model Maintenance

Data Preparation
The most essential part of any Data Science project is Data Preparation. It consumes 60% of
the time spent on the project.
There are many things we can do to ensure
data is used in the most productive and
meaningful manner

Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA CLEANING:
• Data cleaning is important because bad data may lead to a bad
model
• It handles missing values which may be due to various reasons
• In this, NULL or unwanted values is also handled
• It improves business decisions and increases productivity

Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA TRANSFORMATION:
• Data Transformation turns raw data formats into desired outputs
• It involves normalization of data
• Min-max normalization
• Z-score normalization

Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
HANDLING OUTLIERS:
• Outliers are observations which are distant from the rest of the
data.
• They can be good or bad for the data
• Univariate and Multivariate methods are used for detection
• Outliers can be used for fraud detection
• Plots used: Scatter plots, box plots

Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA INTEGRITY:
• Data integrity means that data is accurate and reliable
• It is important where business decisions are made on the basis of
data regularly

Data Cleaning
Data Transformation
Handling Outliers
Data Integration
Data Reduction
Data Preparation
DATA REDUCTION:
• This step reduces multitudinous amounts of data into meaningful
parts
• It increases storage capacity
• It is a factor helping in reducing costs
• Involves data deduplication which eliminates redundant data

Data Preparation – Data Cleaning
Data cleaning is one of the most common tasks of data preparation which involves ensuring that the data is:
Valid
Consistent
Uniform
Accurate

Let’s have a look at a bank’s data with it’s customer details

We can import and read the data using pandas library of Python

Let’s look at the Geography column which has some missing
values
In this case, we might not want to assume the location of the
customers

So, let’s fill the missing values with empty string

In numerical columns, filling the unwanted values with 'mean' can help us to even out the data set:
Before: After:
Data.CreditScore =
data.CreditScore.fillna(data.creditScore.mean())

We can remove all rows with unwanted values, however this is a very aggressive step
Dropping all rows with
any NA values is easy:
data.dropna()

data.dropna()
We can also drop rows that
have all NA values:
data.dropna(how=’all’)

data.dropna()
We can also drop rows that
have all NA values:
data.dropna(how=’all’)
We can also put a
limitation example, the
data needs to have at least
5 non-null values
data.dropna(thresh=5)

Model Maintenance
Data Mining
Data Acquisition
Data Preparation
Data Mining
Model Building
Model Maintenance

Data Mining
After acquiring the data, preparing and
cleaning it, Data Scientist discovers
patterns and relationship in data to make
better business decisions

Data Mining
It’s a discovery process, to get
hidden and useful knowledge.
It is commonly referred as ‘Data
Mining’

Data Mining
Let me show you how to do data
mining using Tableau.

Data Mining using Tableau
You can easily download Tableau software from
https://www.tableau.com/

You can connect your csv file to Tableau desktop to start
mining the data!
You can connect to various other sources also

Tableau will automatically identify the fields as Dimensions
(descriptive) and Measures (numeric fields)

Problem Statement
It is important to understand the
problem statement first!
In this problem, we have a dataset of a bank’s customers.
We want to understand the exit behavior of the customers based
on different variables, for instance:
• Gender
• Credit card holding
• Geography

Now, we have a dataset of about 10,000 rows as shown below and we need to find out potential
customers who will exit first

Let’s first evaluate on the basis of gender, to understand if this variable will affect the model

We have moved the variable ‘exited’ from Measures to Dimensions

Here, Tableau recognizes ‘exited’ column as measure because of its
numerical datatype. However, ‘exited’ is categorical for us as it tells
whether a person has left or not.
So, we have placed ‘exited’ in dimensions as seen in the screenshot:

If we put ‘gender’ in columns and ‘exited’ on colors
Where:
0 – people who did not exit
1 – people who did exit

For better understanding, let’s
prepare a bar chart which
depicts male/female who
stayed/exited

Stayed
Exited
You can give alias describing that 0
means stayed and 1 means exited

To compare the exit rates, you can also add a
reference line by right clicking on the right
axis and selecting ‘Add Reference Line’

So, we have added the reference line and we can say
that female customers exit more than average as
compared to male customers!

We can also see on the basis of whether or not they have a credit card. Again if a person has credit
card or not is a categorical decision so we put it into dimensions and then evaluate:
As explained earlier, we have added the
alias and a reference line.

Hence, a customer leaving the bank
is not dependent on a person holding
credit card
Here, we can see that we cannot differentiate the exit behavior between people having a credit and those who do not

Here, we can see that customers leaving the
bank in Germany is more than that in France or
Spain. So, we treat this as an anomaly and it
can impact the model
Similarly, we can evaluate on the basis of geography:

So, we can ignore the variable ‘has
credit card’ because it will not impact the
model

And focus on other variables like
‘gender’, ‘geography’ which can
actually have an impact

So, we can make business
decisions based on these findings.
Lets see some advantages of data
mining

Advantages of Data Mining
Predicts future trends
Signifies customer patterns
Helps decision making
Quick fraud detection
Choosing right algorithm

Model Maintenance
Model Building
Data Acquisition
Data Preparation
Data Mining
Model Building
Model Maintenance

What is Model Building?
The model is built by selecting a
machine learning algorithm that suits
the data, use case, and available
resources

Many choose to build more than
one model and go ahead only with
the best fit

There are many machine learning
algorithms, chosen based on the
data and the problem at hand

Machine Learning Algorithms used by Data ScientistsCategorialContinuous
Supervised Unsupervised
• Regression
Linear
Multiple
• Decision Trees
• Random Forest
• Classification
KNN
Logistic Regression
SVM
Naïve-Baiyes
• Clustering
K-Means
PCA
• Association Analysis
• Hidden Markov Model

Let me take you through a use
case for better understanding

Predicting World happiness
Model Building - Multiple Linear Regression
Questions To Ask:
1) How to describe the data?
2) Can we make a predictive model to calculate happiness score?

Variables used in the model
Happiness Rank
Region
Health
Happiness Score
Economy
Freedom
Country
Family
Trust
Generosity Dystopia Residual

Variables Used in Data:
Happiness Rank: determined by a country’s happiness score
Happiness Score: A score given to a country based on adding up the rankings that a population has given to each category (normalized)
Country: The country in question
Region: The region that the country belongs too (different than continent)
Economy: GDP per capita of the country
Family: quality of family life, nuclear and joint family
Health: ranking healthcare availability and average life expectancy in the country
Freedom: how much an individual is able to conduct them self based on their free will
Trust: that the government to not be corrupt
Generosity: how much their country is involved in peacekeeping and global aid
Dystopia Residual: Dystopia happiness score (1.85) i.e. the score of a hypothetical country that has a lower rank than the lowest ranking country
on the report, plus the residual value of each country (a number that is left over from the normalization of the variables which cannot be
explained).
For instructor

Importing the Python Libraries:

Preparing and Describing the Data:
• Read csv file using pandas “read_csv” and put it in a dataframe ‘happiness_2015’
• Similarly for 2016 and 2017, store the data in happiness_2016 and happiness_2017 respectively

Model Building - Multiple Linear RegressionHappiness_2016

Happiness_2017

We can concatenate the three
data or we can build a model
separately for one csv
Head() shows top countries with
highest happiness_score

We can create a visual which gives us a more appealing view of where each country is placed in the World ranking
report

The lighter colored countries
have a lower ranking.
The darker colored countries have
higher ranking on the report (i.e. are
the “happiest)

We can find out the correlation between Happiness Score and Happiness Rank by plotting a scatterplot

And the code is right
here!

• Happiness score determines how the country is ranked,
so happiness score as predictor and the happiness rank
as the dependent variable.
• The higher the score the lower the numerical rank, and
higher the happiness rating
• Therefore, happiness score and rank are negatively
correlated(as score increases, rank decreases)
Let’s dissect the graph:

However, we see that rank is directly
dependent on Happiness Score. So, we can
drop rank from our data frame

We can draw a heat map and
see the correlation between the
variables

We can see that happiness score is
strongly correlated with economy and
health, followed by family and
freedom.

Thus, in our model, we should see strong correlation between variables when finding the
coefficients.

Now, we can start to use SkLearn to construct the model
We drop categorical values and happiness rank because we will not explore it in this report

Let’s divide the dataset into train and test for further model building and testing:
So, data is split in 80:20 ratio between and train and test dataset respectively

Then, we import sklearn’s linear regression to fit the regression to the training set

Evaluate the predicted values and calculate the difference:

We can print the intercept and coefficients for the train dataset:

Trick: Right after you have trained your model, you know the order of the coefficients
This will print the coefficients
and the corresponding
feature
Tip: you can also use dictionary to reuse the coefficients later

For testing the model, let’s find out mean error:

Using sklearn.predict, we can use this model to predict the happiness scores for the first 100 countries in our model
How do these predictions compare
to the actual values in our data?

Let’s create a plot of our actual
happiness score versus the
predicted happiness score.

Hence, we can say that our model is a pretty good indicator of the
actual happiness score as there is a strong positive correlation
between the actual and the predicted values

HappinessScore=(0.0001289) +
(1.000048 * Economy) +
(0.999997 * Family)+
(0.999865 * Health) +
(0.999920 * Freedom) +
(0.999993 * Trust) +
(1.000029 * Generosity) +
(0.999971 * DystopiaResidual)
The Multiple Linear Regression Model for Happiness Score:

Before you present your data model and insights, first understand:
What is your goal?Who is your target audience?
Communicate Results
But the battle is not over yet!!

Communicate Results
Before you present your data model and insights, first understand:
What is your goal?Who is your target audience?

Communicate Results
After you are clear with your objective, think about:
• What’s the question?
• What’s your
methodology?
• What’s the answer?

Communicate Results
Then, you’re finally ready to communicate your results to the
business teams so that it easily goes into execution phase

Model Maintenance
Model Maintenance
Data Acquisition
Data Preparation
Data Mining
Model Maintenance
Model Building

Model Maintenance
After we have deployed
the model, it is also
important to prevent the
model from deteriorating.
How can we do that?

ASSESS: Once in a while, we have to run a fresh
sample through data to assess our model
RETRAIN: After assessment, if you are not happy
with the performance of the model then you
retrain the model
REBUILD: If retrain fails too, then you have to rebuild by
finding all the variables again and build a new model
Model Maintenance

Summary
What is data science? What a data scientist does? Data acquisition
Model maintenanceMODEL BuildingData Mining using tableau

Model Building Using ARIMA
Plot the data:
As we can observe that mean and variance are
changing with time. Hence, this is non-stationary
series

Auto Correlation Function(ACF):

Partial Auto Correlation Function(PACF):

Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

Similar a Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn (20)

Más de Simplilearn

Más de Simplilearn (20)

Último

Último (20)

Data Science Training | Data Science For Beginners | Data Science With Python Course | Simplilearn

Notas del editor