SlideShare a Scribd company logo
1 of 49
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Agenda for Today’s Session
Why Data Science?
What is Data Science?
Who is a Data Scientist?
How a Problem is Solved in Data Science?
Data Science Components
Demo
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why Data Science?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why Data Science?
The most abundant thing today, is data. We have data about everything which is increasing multifolds everyday!
Increase in data
Then
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Data Science?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Data Science?
It is called data-driven science, it is an inter-disciplinary field about scientific methods, processes and systems to extract
knowledge or insights from data in various forms, either structured or unstructured.
A question that usually is asked to data scientists is
“Tell us something, that we don’t know?”
It involves:
Programming + Statistics + Business
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Who is a Data Scientist?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Who is a Data Scientist?
MATHS
Statistics
Discrete Maths
Information Theory
Combinatorics
Decision Theory
Machine Learning
Data Viz Builders
Statistical
Programmers
Econometricians
Management
Scientists
Actuaries
DATA
SCIENTIST
BUSINESS
Economics
Finance
Marketing
Operations
Management
INFORMATION
SYSTEMS
Computer Science
Software Engineering
Systems Development
BI Developers
Data Analysis
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Role Of A Data Scientist
The Data Scientist will be responsible for designing and creating processes and layouts for complex,
large-scale data sets used for modeling, data mining, and research purposes.
Responsibilities
➢ Selecting features, building and optimizing classifiers using machine learning techniques.
➢ Data mining using state-of-the-art methods.
➢ Extending company’s data with third party sources of information when needed.
➢ Processing, cleansing, and verifying the integrity of data for analysis.
➢ Building predictive models using Machine Learning algorithms.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How a problem is solved in Data Science?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ Discovery involves acquiring data from all the identified internal and external
sources that can help answer the business question.
➢ This data could be
• logs from webservers
• social media data
• census datasets
• data streamed from online sources via APIs
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
Doctor gets this data from the medical history
of the patient.
Attributes:
npreg – Number of times pregnant
glucose – Plasma glucose concentration
bp – Blood pressure
skin – Triceps skinfold thickness
bmi – Body mass index
ped – Diabetes pedigree function
age – Age
income – Income
Income is an irrelevant attribute in the
prediction of diabetes
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ The data can have a lot of inconsistencies like missing values, blank columns,
abrupt values and incorrect data format which need to be cleaned.
➢ It is required to explore, preprocess and condition data prior to modeling.
➢ This will help you to spot the outliers and establish a relationship between the
variables.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
This data has lot of anomalies and needs cleansing before further analysis
can be done.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate
Results
We clean and preprocess this data by removing the outliers, filling up the
null values and normalizing the data type.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ Here, we determine the methods and techniques to draw the relationships
between variable.
➢ Apply Exploratory Data Analytics (EDA) using various statistical formulas and
visualization tools.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate
Results
Use of visualization techniques like histograms, line graphs, box plots to get a fair idea
of the distribution of data.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢ Develop datasets for training and testing purposes.
➢ Consider whether existing tools will suffice for running the models.
➢ Analyze various learning techniques like classification, association and clustering
to build the model.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate
Results
This is a decision tree based on different attributes.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Data Preparation
Model Planning
Model Building
Operationalize
Communicate Results
➢Deliver final reports, briefings, code and technical documents.
➢Implement pilot project in a real-time production environment.
➢Look for performance constraints if any.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Initialization
Model Planning
Model Building
Deployment
Communicate Results
➢ Identify all the key findings and communicate to the stakeholders.
➢ Explaining the model and result to medical authorities.
➢ Determine if the results of the project are a success or a failure based on the
criteria developed.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
Discovery
Initialization
Model Planning
Model Building
Deployment
Communicate Results
➢ Diabetes Positive set:
• glucose > 154
• glucose >127 & <= 154 + bmi >30.9
• glucose<=127 + pregnant >5
• glucose<=127 + pregnant <=5 + age >28
• glucose<=127 + pregnant <=5 + age <=28 +bmi > 30.9
➢ Diabetes Negative set:
• glucose > 154
• glucose >127 & <= 154 + bmi <=30.9
• glucose<=127 + pregnant <=5 + age <=28 +bmi <= 30.9
➢ We can use this decision tree result to know whether the patient is vulnerable
to diabetes or not.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How to choose Algorithms in Data Science?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Solving in Data Science
We take a top down approach to answer the same:
Classification Algorithm
Anomaly Detection Algorithm
Regression Algorithms
Clustering Algorithms
Reinforcement Learning
Q1.
Q2.
Q4.
Q3.
Q5.
Is this A or B?
Is this weird?
How much or how many?
How is this organized?
What should I do next?
These are the 5 questions which can be answered in data science.
These algorithms are fitted into three types of categories, which are the following:
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Categories of Algorithms
Supervised Learning Reinforcement Learning Unsupervised Learning
Types of Learning
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Supervised Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Let’s take an example here. Say you are a teacher, and your way of teaching is,
To teach by example, i.e for every problem in their life you are providing solutions to them,
this type of learning is called supervised learning.
Let’s take the same example forward:
Supervised learning is a type of machine learning algorithm that uses a known dataset
(called the training dataset) to make predictions. The training dataset includes input data
and response values. From it, the supervised learning algorithm seeks to build a model that
can make predictions of the response values for a new dataset.
Teaching by Example
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Unsupervised Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
When your kids are taking decisions out of their own understanding, this type of learning
would be Unsupervised Learning.
Unsupervised learning is a type of machine learning algorithm used to draw inferences
from datasets consisting of input data without labeled responses.
Self Learning
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Reinforcement learning is an area of machine learning inspired by behaviorist psychology,
concerned with how software agents ought to take actions in an environment so as to
maximize some notion of cumulative reward.
Reinforcement Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
If a new situation comes up, the kid will take actions on his own i.e from his past
experiences, but as a parent towards the end of an action you can tell him whether he did
good or not.
Good or Bad?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Data Science Tools
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Data Science Tools
The tool that is widely used by Data Analysts is R
R is an open source programming language and software environment for statistical computing and graphics that is supported by
the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing
statistical software and data analysis.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why R?
Programming and Statistical Language
Data Analysis and Visualization
Apart from being used as a statistical language , it can
also be used a programming language for
analytical purposes.
Apart from being one of the most dominant analytics tools, R also
is one of the most popular tools used for data visualization.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Why R?
Simple and Easy to Learn
Free and Open Source
R is a simple and easy to learn, read & write
R is an example of a FLOSS (Free/Libre and Open Source
Software) which means one can freely distribute copies of this
software, read it's source code, modify it, etc.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Datasets
A collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a
computer
Now to do any kind of analysis, you need data right? This need of data is fulfilled through Data Sets.
What are datasets?
Sample Dataset
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Datasets
But what if you have a HUGE dataset!
Ever heard of Big Data?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Big Data?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Big Data?
“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications”
Volume Variety Velocity Value Veracity
Uncertainty and
inconsistencies in
the data
Finding correct
meaning out of the
data
Data is being
generated at an
alarming rate
Processing different
types of data
Processing
increasing huge
data sets
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
Big Data
Now these problems had to be dealt with, right?
Hence, Hadoop came into the picture.
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Hadoop?
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File System
Processing:
Allows parallel &
distributed processing
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File System
Processing:
Allows parallel &
distributed processing
www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING
What is Hadoop?
Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion
H A D O O P
Storage:
Distributed File System
Processing:
Allows parallel &
distributed processing
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Hadoop?
Now you need a data analytics tool, which can handle this much processing and data.
For that we use Spark R
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Spark R?
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.1.1,SparkR provides a
distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data
frames, dplyr) but on large datasets.
WOW!
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
This dataset provides detailed road safety data about the circumstances of personal injury road accidents from 1979 -2013. Our
aim is to find the following things:
To find the number of accidents happened:
✓ In various weather conditions
✓ In various light conditions
✓ In various road surface conditions
✓ With make information of the accident vehicles
✓ During various days of week
✓ On various road types
✓ Number of casualties per accident per year
✓ Number of accidents happening at various speed limits
We have to find the results of the queries in Hadoop
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
with make information of the accident vehicles
Huge amount of
Accident data
1 Data Stored
in HDFS
2 Using R for
Analysis
3
in various weather conditions
in various light conditions
in various road surface conditions
Analyze the following queries for accidents
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Session In A Minute
Why Data Science?
Demo
How is a problem solved in Data
Science?
Who is a Data Scientist?
Data Science Components
What is Data Science?
www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Thank You …
Questions/Queries/Feedback

More Related Content

What's hot

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
Simplilearn
 

What's hot (20)

What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data science
Data scienceData science
Data science
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Science
Data ScienceData Science
Data Science
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Ppt on data science
Ppt on data science Ppt on data science
Ppt on data science
 

Viewers also liked

Viewers also liked (20)

Power BI Training | Getting Started with Power BI | Power BI Tutorial | Power...
Power BI Training | Getting Started with Power BI | Power BI Tutorial | Power...Power BI Training | Getting Started with Power BI | Power BI Tutorial | Power...
Power BI Training | Getting Started with Power BI | Power BI Tutorial | Power...
 
What Is DevOps? | Introduction To DevOps | DevOps Tools | DevOps Tutorial | D...
What Is DevOps? | Introduction To DevOps | DevOps Tools | DevOps Tutorial | D...What Is DevOps? | Introduction To DevOps | DevOps Tools | DevOps Tutorial | D...
What Is DevOps? | Introduction To DevOps | DevOps Tools | DevOps Tutorial | D...
 
Bitcoin Blockchain Explained | Understanding Bitcoin and Blockchain | Blockch...
Bitcoin Blockchain Explained | Understanding Bitcoin and Blockchain | Blockch...Bitcoin Blockchain Explained | Understanding Bitcoin and Blockchain | Blockch...
Bitcoin Blockchain Explained | Understanding Bitcoin and Blockchain | Blockch...
 
Android Studio Tutorial For Beginners -2 | Android Development Tutorial | And...
Android Studio Tutorial For Beginners -2 | Android Development Tutorial | And...Android Studio Tutorial For Beginners -2 | Android Development Tutorial | And...
Android Studio Tutorial For Beginners -2 | Android Development Tutorial | And...
 
Angular 4 Tutorial For Beginners | Angular 4 Introduction | Angular 4 Trainin...
Angular 4 Tutorial For Beginners | Angular 4 Introduction | Angular 4 Trainin...Angular 4 Tutorial For Beginners | Angular 4 Introduction | Angular 4 Trainin...
Angular 4 Tutorial For Beginners | Angular 4 Introduction | Angular 4 Trainin...
 
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | EdurekaBig Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
Big Data Use Cases | Hadoop Tutorial for Beginners | Hadoop Training | Edureka
 
Angular 4 Components | Angular 4 Tutorial For Beginners | Learn Angular 4 | E...
Angular 4 Components | Angular 4 Tutorial For Beginners | Learn Angular 4 | E...Angular 4 Components | Angular 4 Tutorial For Beginners | Learn Angular 4 | E...
Angular 4 Components | Angular 4 Tutorial For Beginners | Learn Angular 4 | E...
 
Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...
Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...
Artificial Neural Network Tutorial | Deep Learning With Neural Networks | Edu...
 
Docker Compose | Containerizing MEAN Stack Application | DevOps Tutorial | Ed...
Docker Compose | Containerizing MEAN Stack Application | DevOps Tutorial | Ed...Docker Compose | Containerizing MEAN Stack Application | DevOps Tutorial | Ed...
Docker Compose | Containerizing MEAN Stack Application | DevOps Tutorial | Ed...
 
Django Rest Framework | How to Create a RESTful API Using Django | Django Tut...
Django Rest Framework | How to Create a RESTful API Using Django | Django Tut...Django Rest Framework | How to Create a RESTful API Using Django | Django Tut...
Django Rest Framework | How to Create a RESTful API Using Django | Django Tut...
 
Selenium Page Object Model Using Page Factory | Selenium Tutorial For Beginne...
Selenium Page Object Model Using Page Factory | Selenium Tutorial For Beginne...Selenium Page Object Model Using Page Factory | Selenium Tutorial For Beginne...
Selenium Page Object Model Using Page Factory | Selenium Tutorial For Beginne...
 
Docker Swarm For High Availability | Docker Tutorial | DevOps Tutorial | Edureka
Docker Swarm For High Availability | Docker Tutorial | DevOps Tutorial | EdurekaDocker Swarm For High Availability | Docker Tutorial | DevOps Tutorial | Edureka
Docker Swarm For High Availability | Docker Tutorial | DevOps Tutorial | Edureka
 
Cloud Computing Tutorial For Beginners | What is Cloud Computing | AWS Traini...
Cloud Computing Tutorial For Beginners | What is Cloud Computing | AWS Traini...Cloud Computing Tutorial For Beginners | What is Cloud Computing | AWS Traini...
Cloud Computing Tutorial For Beginners | What is Cloud Computing | AWS Traini...
 
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
Angular 4 Data Binding | Two Way Data Binding in Angular 4 | Angular 4 Tutori...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
 
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
Azure Interview Questions And Answers | Azure Tutorial For Beginners | Azure ...
 
React Components Lifecycle | React Tutorial for Beginners | ReactJS Training ...
React Components Lifecycle | React Tutorial for Beginners | ReactJS Training ...React Components Lifecycle | React Tutorial for Beginners | ReactJS Training ...
React Components Lifecycle | React Tutorial for Beginners | ReactJS Training ...
 
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
 
ReactJS Tutorial For Beginners | ReactJS Redux Training For Beginners | React...
ReactJS Tutorial For Beginners | ReactJS Redux Training For Beginners | React...ReactJS Tutorial For Beginners | ReactJS Redux Training For Beginners | React...
ReactJS Tutorial For Beginners | ReactJS Redux Training For Beginners | React...
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 

Similar to Data Science Tutorial | Introduction To Data Science | Data Science Training | Edureka

Similar to Data Science Tutorial | Introduction To Data Science | Data Science Training | Edureka (20)

Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Tutori...
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Predictive Analytics Using R | Edureka
Predictive Analytics Using R | EdurekaPredictive Analytics Using R | Edureka
Predictive Analytics Using R | Edureka
 
How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems? How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems?
 
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
De carlo rizk 2010 icelw
De carlo rizk 2010 icelwDe carlo rizk 2010 icelw
De carlo rizk 2010 icelw
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Christ PPT Template.pptx
Christ PPT Template.pptxChrist PPT Template.pptx
Christ PPT Template.pptx
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
 
Master Data Analyst Course in Bangalore with ProITBridge's Expert Course.pdf
Master Data Analyst Course in Bangalore with ProITBridge's Expert Course.pdfMaster Data Analyst Course in Bangalore with ProITBridge's Expert Course.pdf
Master Data Analyst Course in Bangalore with ProITBridge's Expert Course.pdf
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Unveiling the Dynamics of Exploratory Data Analysis_ A Deep Dive into Data Sc...
Unveiling the Dynamics of Exploratory Data Analysis_ A Deep Dive into Data Sc...Unveiling the Dynamics of Exploratory Data Analysis_ A Deep Dive into Data Sc...
Unveiling the Dynamics of Exploratory Data Analysis_ A Deep Dive into Data Sc...
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 

More from Edureka!

More from Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Data Science Tutorial | Introduction To Data Science | Data Science Training | Edureka

  • 2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Agenda for Today’s Session Why Data Science? What is Data Science? Who is a Data Scientist? How a Problem is Solved in Data Science? Data Science Components Demo
  • 3. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Why Data Science?
  • 4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Why Data Science? The most abundant thing today, is data. We have data about everything which is increasing multifolds everyday! Increase in data Then
  • 5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Data Science?
  • 6. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Data Science? It is called data-driven science, it is an inter-disciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured. A question that usually is asked to data scientists is “Tell us something, that we don’t know?” It involves: Programming + Statistics + Business
  • 7. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Who is a Data Scientist?
  • 8. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Who is a Data Scientist? MATHS Statistics Discrete Maths Information Theory Combinatorics Decision Theory Machine Learning Data Viz Builders Statistical Programmers Econometricians Management Scientists Actuaries DATA SCIENTIST BUSINESS Economics Finance Marketing Operations Management INFORMATION SYSTEMS Computer Science Software Engineering Systems Development BI Developers Data Analysis
  • 9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Role Of A Data Scientist The Data Scientist will be responsible for designing and creating processes and layouts for complex, large-scale data sets used for modeling, data mining, and research purposes. Responsibilities ➢ Selecting features, building and optimizing classifiers using machine learning techniques. ➢ Data mining using state-of-the-art methods. ➢ Extending company’s data with third party sources of information when needed. ➢ Processing, cleansing, and verifying the integrity of data for analysis. ➢ Building predictive models using Machine Learning algorithms.
  • 10. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How a problem is solved in Data Science?
  • 11. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science
  • 12. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results ➢ Discovery involves acquiring data from all the identified internal and external sources that can help answer the business question. ➢ This data could be • logs from webservers • social media data • census datasets • data streamed from online sources via APIs
  • 13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results Doctor gets this data from the medical history of the patient. Attributes: npreg – Number of times pregnant glucose – Plasma glucose concentration bp – Blood pressure skin – Triceps skinfold thickness bmi – Body mass index ped – Diabetes pedigree function age – Age income – Income Income is an irrelevant attribute in the prediction of diabetes
  • 14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results ➢ The data can have a lot of inconsistencies like missing values, blank columns, abrupt values and incorrect data format which need to be cleaned. ➢ It is required to explore, preprocess and condition data prior to modeling. ➢ This will help you to spot the outliers and establish a relationship between the variables.
  • 15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results This data has lot of anomalies and needs cleansing before further analysis can be done.
  • 16. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results We clean and preprocess this data by removing the outliers, filling up the null values and normalizing the data type.
  • 17. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results ➢ Here, we determine the methods and techniques to draw the relationships between variable. ➢ Apply Exploratory Data Analytics (EDA) using various statistical formulas and visualization tools.
  • 18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results Use of visualization techniques like histograms, line graphs, box plots to get a fair idea of the distribution of data.
  • 19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results ➢ Develop datasets for training and testing purposes. ➢ Consider whether existing tools will suffice for running the models. ➢ Analyze various learning techniques like classification, association and clustering to build the model.
  • 20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results This is a decision tree based on different attributes.
  • 21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results ➢Deliver final reports, briefings, code and technical documents. ➢Implement pilot project in a real-time production environment. ➢Look for performance constraints if any.
  • 22. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Initialization Model Planning Model Building Deployment Communicate Results ➢ Identify all the key findings and communicate to the stakeholders. ➢ Explaining the model and result to medical authorities. ➢ Determine if the results of the project are a success or a failure based on the criteria developed.
  • 23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science Discovery Initialization Model Planning Model Building Deployment Communicate Results ➢ Diabetes Positive set: • glucose > 154 • glucose >127 & <= 154 + bmi >30.9 • glucose<=127 + pregnant >5 • glucose<=127 + pregnant <=5 + age >28 • glucose<=127 + pregnant <=5 + age <=28 +bmi > 30.9 ➢ Diabetes Negative set: • glucose > 154 • glucose >127 & <= 154 + bmi <=30.9 • glucose<=127 + pregnant <=5 + age <=28 +bmi <= 30.9 ➢ We can use this decision tree result to know whether the patient is vulnerable to diabetes or not.
  • 24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training How to choose Algorithms in Data Science?
  • 25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Problem Solving in Data Science We take a top down approach to answer the same: Classification Algorithm Anomaly Detection Algorithm Regression Algorithms Clustering Algorithms Reinforcement Learning Q1. Q2. Q4. Q3. Q5. Is this A or B? Is this weird? How much or how many? How is this organized? What should I do next? These are the 5 questions which can be answered in data science. These algorithms are fitted into three types of categories, which are the following:
  • 26. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Categories of Algorithms Supervised Learning Reinforcement Learning Unsupervised Learning Types of Learning
  • 27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Supervised Learning Supervised Learning Unsupervised Learning Reinforcement Learning Let’s take an example here. Say you are a teacher, and your way of teaching is, To teach by example, i.e for every problem in their life you are providing solutions to them, this type of learning is called supervised learning. Let’s take the same example forward: Supervised learning is a type of machine learning algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. From it, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset. Teaching by Example
  • 28. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Unsupervised Learning Supervised Learning Unsupervised Learning Reinforcement Learning When your kids are taking decisions out of their own understanding, this type of learning would be Unsupervised Learning. Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. Self Learning
  • 29. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Reinforcement Learning Supervised Learning Unsupervised Learning Reinforcement Learning If a new situation comes up, the kid will take actions on his own i.e from his past experiences, but as a parent towards the end of an action you can tell him whether he did good or not. Good or Bad?
  • 30. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Data Science Tools
  • 31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Data Science Tools The tool that is widely used by Data Analysts is R R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
  • 32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Why R? Programming and Statistical Language Data Analysis and Visualization Apart from being used as a statistical language , it can also be used a programming language for analytical purposes. Apart from being one of the most dominant analytics tools, R also is one of the most popular tools used for data visualization.
  • 33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Why R? Simple and Easy to Learn Free and Open Source R is a simple and easy to learn, read & write R is an example of a FLOSS (Free/Libre and Open Source Software) which means one can freely distribute copies of this software, read it's source code, modify it, etc.
  • 34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Datasets A collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer Now to do any kind of analysis, you need data right? This need of data is fulfilled through Data Sets. What are datasets? Sample Dataset
  • 35. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Datasets But what if you have a HUGE dataset! Ever heard of Big Data?
  • 36. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Big Data?
  • 37. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Big Data? “Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Volume Variety Velocity Value Veracity Uncertainty and inconsistencies in the data Finding correct meaning out of the data Data is being generated at an alarming rate Processing different types of data Processing increasing huge data sets
  • 38. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING Big Data Now these problems had to be dealt with, right? Hence, Hadoop came into the picture.
  • 39. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Hadoop?
  • 40. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion H A D O O P Storage: Distributed File System Processing: Allows parallel & distributed processing
  • 41. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion H A D O O P Storage: Distributed File System Processing: Allows parallel & distributed processing
  • 42. www.edureka.co/big-data-and-hadoopEDUREKA HADOOP CERTIFICATION TRAINING What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion H A D O O P Storage: Distributed File System Processing: Allows parallel & distributed processing
  • 43. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Hadoop? Now you need a data analytics tool, which can handle this much processing and data. For that we use Spark R
  • 44. www.edureka.co/data-scienceEdureka’s Data Science Certification Training What is Spark R? SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.1.1,SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. WOW!
  • 46. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo This dataset provides detailed road safety data about the circumstances of personal injury road accidents from 1979 -2013. Our aim is to find the following things: To find the number of accidents happened: ✓ In various weather conditions ✓ In various light conditions ✓ In various road surface conditions ✓ With make information of the accident vehicles ✓ During various days of week ✓ On various road types ✓ Number of casualties per accident per year ✓ Number of accidents happening at various speed limits We have to find the results of the queries in Hadoop
  • 47. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Demo with make information of the accident vehicles Huge amount of Accident data 1 Data Stored in HDFS 2 Using R for Analysis 3 in various weather conditions in various light conditions in various road surface conditions Analyze the following queries for accidents
  • 48. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Session In A Minute Why Data Science? Demo How is a problem solved in Data Science? Who is a Data Scientist? Data Science Components What is Data Science?
  • 49. www.edureka.co/data-scienceEdureka’s Data Science Certification Training Thank You … Questions/Queries/Feedback