This Edureka Data Science tutorial will help you understand in and out of Data Science with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1. Why Data Science?
2. What is Data Science?
3. Who is a Data Scientist?
4. How a Problem is Solved in Data Science?
5. Data Science Components
2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Agenda for Today’s Session
Why Data Science?
What is Data Science?
Who is a Data Scientist?
How a Problem is Solved in Data Science?
Data Science Components
Demo
4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why Data Science?
The most abundant thing today, is data. We have data about everything which is increasing multifolds everyday!
Increase in data
Then
6. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Data Science?
It is called data-driven science, it is an inter-disciplinary field about scientific methods, processes and systems to extract
knowledge or insights from data in various forms, either structured or unstructured.
A question that usually is asked to data scientists is
“Tell us something, that we don’t know?”
It involves:
Programming + Statistics + Business
8. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Who is a Data Scientist?
MATHS
Statistics
Discrete Maths
Information Theory
Combinatorics
Decision Theory
Machine Learning
Data Viz Builders
Statistical
Programmers
Econometricians
Management
Scientists
Actuaries
DATA
SCIENTIST
BUSINESS
Economics
Finance
Marketing
Operations
Management
INFORMATION
SYSTEMS
Computer Science
Software Engineering
Systems Development
BI Developers
Data Analysis
9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Role Of A Data Scientist
The Data Scientist will be responsible for designing and creating processes and layouts for complex,
large-scale data sets used for modeling, data mining, and research purposes.
Responsibilities
➢ Selecting features, building and optimizing classifiers using machine learning techniques.
➢ Data mining using state-of-the-art methods.
➢ Extending company’s data with third party sources of information when needed.
➢ Processing, cleansing, and verifying the integrity of data for analysis.
➢ Building predictive models using Machine Learning algorithms.
12. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ Discovery involves acquiring data from all the identified internal and external
sources that can help answer the business question.
➢ This data could be
• logs from webservers
• social media data
• census datasets
• data streamed from online sources via APIs
13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
Doctor gets this data from the medical history
of the patient.
Attributes:
npreg – Number of times pregnant
glucose – Plasma glucose concentration
bp – Blood pressure
skin – Triceps skinfold thickness
bmi – Body mass index
ped – Diabetes pedigree function
age – Age
income – Income
Income is an irrelevant attribute in the
prediction of diabetes
14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ The data can have a lot of inconsistencies like missing values, blank columns,
abrupt values and incorrect data format which need to be cleaned.
➢ It is required to explore, preprocess and condition data prior to modeling.
➢ This will help you to spot the outliers and establish a relationship between the
variables.
15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
This data has lot of anomalies and needs cleansing before further analysis
can be done.
16. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate
Results
We clean and preprocess this data by removing the outliers, filling up the
null values and normalizing the data type.
17. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ Here, we determine the methods and techniques to draw the relationships
between variable.
➢ Apply Exploratory Data Analytics (EDA) using various statistical formulas and
visualization tools.
18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate
Results
Use of visualization techniques like histograms, line graphs, box plots to get a fair idea
of the distribution of data.
19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ Develop datasets for training and testing purposes.
➢ Consider whether existing tools will suffice for running the models.
➢ Analyze various learning techniques like classification, association and clustering
to build the model.
20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate
Results
This is a decision tree based on different attributes.
21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢Deliver final reports, briefings, code and technical documents.
➢Implement pilot project in a real-time production environment.
➢Look for performance constraints if any.
22. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Initialization
Model Planning
Model Building
Deployment
Communicate Results
➢ Identify all the key findings and communicate to the stakeholders.
➢ Explaining the model and result to medical authorities.
➢ Determine if the results of the project are a success or a failure based on the
criteria developed.
23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Initialization
Model Planning
Model Building
Deployment
Communicate Results
➢ Diabetes Positive set:
• glucose > 154
• glucose >127 & <= 154 + bmi >30.9
• glucose<=127 + pregnant >5
• glucose<=127 + pregnant <=5 + age >28
• glucose<=127 + pregnant <=5 + age <=28 +bmi > 30.9
➢ Diabetes Negative set:
• glucose > 154
• glucose >127 & <= 154 + bmi <=30.9
• glucose<=127 + pregnant <=5 + age <=28 +bmi <= 30.9
➢ We can use this decision tree result to know whether the patient is vulnerable
to diabetes or not.
25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
We take a top down approach to answer the same:
Classification Algorithm
Anomaly Detection Algorithm
Regression Algorithms
Clustering Algorithms
Reinforcement Learning
Q1.
Q2.
Q4.
Q3.
Q5.
Is this A or B?
Is this weird?
How much or how many?
How is this organized?
What should I do next?
These are the 5 questions which can be answered in data science.
These algorithms are fitted into three types of categories, which are the following:
27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Supervised Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Let’s take an example here. Say you are a teacher, and your way of teaching is,
To teach by example, i.e for every problem in their life you are providing solutions to them,
this type of learning is called supervised learning.
Let’s take the same example forward:
Supervised learning is a type of machine learning algorithm that uses a known dataset
(called the training dataset) to make predictions. The training dataset includes input data
and response values. From it, the supervised learning algorithm seeks to build a model that
can make predictions of the response values for a new dataset.
Teaching by Example
28. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Unsupervised Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
When your kids are taking decisions out of their own understanding, this type of learning
would be Unsupervised Learning.
Unsupervised learning is a type of machine learning algorithm used to draw inferences
from datasets consisting of input data without labeled responses.
Self Learning
29. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Reinforcement learning is an area of machine learning inspired by behaviorist psychology,
concerned with how software agents ought to take actions in an environment so as to
maximize some notion of cumulative reward.
Reinforcement Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
If a new situation comes up, the kid will take actions on his own i.e from his past
experiences, but as a parent towards the end of an action you can tell him whether he did
good or not.
Good or Bad?
31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Data Science Tools
The tool that is widely used by Data Analysts is R
R is an open source programming language and software environment for statistical computing and graphics that is supported by
the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing
statistical software and data analysis.
32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why R?
Programming and Statistical Language
Data Analysis and Visualization
Apart from being used as a statistical language , it can
also be used a programming language for
analytical purposes.
Apart from being one of the most dominant analytics tools, R also
is one of the most popular tools used for data visualization.
33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why R?
Simple and Easy to Learn
Free and Open Source
R is a simple and easy to learn, read & write
R is an example of a FLOSS (Free/Libre and Open Source
Software) which means one can freely distribute copies of this
software, read it's source code, modify it, etc.
34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Datasets
A collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a
computer
Now to do any kind of analysis, you need data right? This need of data is fulfilled through Data Sets.
What are datasets?
Sample Dataset
37. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Big Data?
“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications”
Volume Variety Velocity Value Veracity
Uncertainty and
inconsistencies in
the data
Finding correct
meaning out of the
data
Data is being
generated at an
alarming rate
Processing different
types of data
Processing
increasing huge
data sets
40. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File System
Processing:
Allows parallel &
distributed processing
41. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File System
Processing:
Allows parallel &
distributed processing
42. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File System
Processing:
Allows parallel &
distributed processing
43. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Hadoop?
Now you need a data analytics tool, which can handle this much processing and data.
For that we use Spark R
44. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Spark R?
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.1.1,SparkR provides a
distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data
frames, dplyr) but on large datasets.
WOW!
46. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
This dataset provides detailed road safety data about the circumstances of personal injury road accidents from 1979 -2013. Our
aim is to find the following things:
To find the number of accidents happened:
✓ In various weather conditions
✓ In various light conditions
✓ In various road surface conditions
✓ With make information of the accident vehicles
✓ During various days of week
✓ On various road types
✓ Number of casualties per accident per year
✓ Number of accidents happening at various speed limits
We have to find the results of the queries in Hadoop
47. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
with make information of the accident vehicles
Huge amount of
Accident data
1 Data Stored
in HDFS
2 Using R for
Analysis
3
in various weather conditions
in various light conditions
in various road surface conditions
Analyze the following queries for accidents
48. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Session In A Minute
Why Data Science?
Demo
How is a problem solved in Data
Science?
Who is a Data Scientist?
Data Science Components
What is Data Science?