2. Slide 2Slide 2Slide 2 www.edureka.co/big-data-and-hadoop
Today we will take you through the following:
What is Big Data & Hadoop?
What is a Data Product?
What is Data Science?
Why Hadoop for Data Science?
Is Hadoop a necessity for Data Science?
AGENDA
3. Slide 3Slide 3Slide 3 www.edureka.co/big-data-and-hadoop
What is
Big Data & Hadoop?
4. Slide 4Slide 4Slide 4 www.edureka.co/big-data-and-hadoop
BIG DATA
Big data is a popular term used to describe the exponential
growth of data.
Big Data can be either Structured data or Unstructured data
or a combination of both.
Big Data
5. Slide 5Slide 5Slide 5 www.edureka.co/big-data-and-hadoop
BIGDATA
3 V’s (Volume, Variety and Velocity) are three defining properties or dimensions of Big Data.
6. Slide 6Slide 6Slide 6 www.edureka.co/big-data-and-hadoop
HADOOP
Hadoop is a programming framework
that supports the processing of large
data sets in a distributed computing
environment.
Hadoop was the first and still
the best tool to handle Big
Data.
7. Slide 7Slide 7Slide 7 www.edureka.co/big-data-and-hadoop
A BRIEF HISTORY OF HADOOP
8. Slide 8Slide 8Slide 8 www.edureka.co/big-data-and-hadoop
HADOOP:- HDFS & MAP-REDUCE
Most efficient for Large-Scale Storage & Processing
HDFS: Distributed file system
Self-Healing Data store
MAP-REDUCE: Distributed computation framework
that handles the complexities of distributed
programming
9. Slide 9Slide 9Slide 9 www.edureka.co/big-data-and-hadoop
KEY TO HADOOP’S POWER
Computation co-located with data
Data and computation system co-designed and co-developed to work
together
Process data in parallel across thousands of “commodity” hardware
nodes
Self-healing; failure handled by software
Designed for one write and multiple reads
There are no random writes
Optimized for minimum seek on hard drives
10. Slide 10Slide 10Slide 10 www.edureka.co/big-data-and-hadoop
What is a Data product?
“A software system whose core functionality
depends on the application of statistical analysis
and machine learning to data.”
11. Slide 11Slide 11Slide 11 www.edureka.co/big-data-and-hadoop
Example #1: People you may know
12. Slide 12Slide 12Slide 12 www.edureka.co/big-data-and-hadoop
Example #2: Spell Correction
14. Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop
DATA SCIENCE
#1: Extracting deep meaning from data
(data mining; finding “gems” in data)
15. Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop
Common Data Science tasks
16. Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop
DATA SCIENCE
#2: Building Data Products
(Delivering Gems on a regular basis)
17. Slide 17Slide 17Slide 17 www.edureka.co/big-data-and-hadoop
Why HADOOP for DATA SCIENCE?
Reason #1:
Explore full datasets
18. Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop
#1: Exploration of Data sets
19. Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop
Why HADOOP for DATA SCIENCE?
Reason #2:
Mining of larger datasets
20. Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop
#2: Mining of larger data sets
More Data ---> Better Outcomes
21. Slide 21Slide 21Slide 21 www.edureka.co/big-data-and-hadoop
Why HADOOP for DATA SCIENCE?
Reason #3:
Large-scale data preparation
22. Slide 22Slide 22Slide 22 www.edureka.co/big-data-and-hadoop
#3: Large-Scale Data preparation
80% of data science work is data preparation
23. Slide 23Slide 23Slide 23 www.edureka.co/big-data-and-hadoop
Reason #4:
Accelerate data-driven innovation
Why HADOOP for DATA SCIENCE?
24. Slide 24Slide 24Slide 24 www.edureka.co/big-data-and-hadoop
Speed Barriers of traditional Data Architectures
25. Slide 25Slide 25Slide 25 www.edureka.co/big-data-and-hadoop
“Schema on read” means faster time-to-innovation
28. Slide 28
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
SURVEY