SlideShare una empresa de Scribd logo
1 de 41
10 concepts the enterprise
decision maker needs to
understand about Hadoop
Donald Miner
Strata + Hadoop World 2016 – San Jose
March 31st, 2016
dminer@minerkasch.com
@donaldpminer
Donald Miner
Purpose of this talk
An honest and minimal introduction to Hadoop
Why is Hadoop popular?
What does Hadoop do well and why?
What is bad about Hadoop?
#1 - Hadoop masks being a distributed system
#1 - Hadoop masks being a distributed system
// This block of code defines the behavior of the map phase
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Split the line of text into words
StringTokenizer itr = new StringTokenizer(value.toString());
// Go through each word and send it
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
// "I've seen this word once!"
context.write(word, one);
}
}
[1]$ hadoop fs -put hamlet.txt datz/hamlet.txt
[2]$ hadoop fs -put macbeth.txt data/macbeth.txt
[3]$ hadoop fs -mv datz/hamlet.txt data/hamlet.txt
[4]$ hadoop fs -ls data/
-rw-r–r– 1 don don 139k 2012-01-31 23:49 /user/don/data/caesar.txt
-rw-r–r– 1 don don 180k 2013-09-25 20:45 /user/don/data/hamlet.txt
-rw-r–r– 1 don don 117k 2013-09-25 20:46 /user/don/data/macbeth.txt
#1 - Hadoop masks being a distributed system
Why is this so important?
What does it not do for me?
#2 - Hadoop scales out linearly
The amount of data, the amount of time something takes,
and the amount of hardware you have are linearly linked1
1. usually
#2 - Hadoop scales out linearly
Double the compute,
Half the time!
#2 - Hadoop scales out linearly
Double the data,
twice the time!
#2 - Hadoop scales out linearly
Double the compute,
Double the compute
The same time!
#2 - Hadoop scales out linearly
Data locality!
#2 - Hadoop scales out linearly
Why is this so important?
What does it not do for me?
#3 - Hadoop runs on commodity hardware
#3 - Hadoop runs on commodity hardware
• Non-proprietary
• Easy to acquire (all it takes is $)
• Value (not necessarily cheap)
• Let software handle the hard problems
#3 - Hadoop runs on commodity hardware
Why is this so important?
What does it not do for me?
#4 - Hadoop handles unstructured data
Query languages like SQL assume some sort of structure
Relational databases and other databases require structure
MapReduce/Spark is just Java/Scala/Python/etc
You can do anything Java can do
HDFS just stores files
You can store anything in a file
#4 - Hadoop handles unstructured data
Why is this so important?
What does it not do for me?
#5 - In Hadoop, you load data first and ask questions later
BEFORE:
ETL
Years of planning
Schemas & ER Diagrams
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
WITH HADOOP:
#5 - In Hadoop, you load data first and ask questions later
#5 - In Hadoop, you load data first and ask questions later
Why is this so important?
What does it not do for me?
#5 - In Hadoop, you load data first and ask questions later
#6 - HDFS stores the data but has some major limitations
• Stores files in folders
• Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)
• 3 replicas of each block
• Blocks are scattered all over the place
• Can scale to thousands of nodes and hundreds of petabytes
FILE BLOCKS
#6 - HDFS stores the data but has some major limitations
Limitations:
• Low IOPs
• Higher latency
• Can’t edit files
• Can’t handle small files
• Low storage efficiency (33%)
• Low throughput on single files
• But…
• High aggregate throughput
• Massive scale
• Software only
• Few bottlenecks
Why is this so important?
What does it not do for me?
#6 - HDFS stores the data but has some major limitations
#7 - YARN controls everything going on and is
mostly behind the scenes
• Controls the compute resources on the cluster
• Was the key new feature in Hadoop 2.0
• Abstracted resource management from MapReduce to be more
general
• MapReduce became just any other application
• YARN is key in enabling multiple compute engines at once
Why is this so important?
What does it not do for me?
#7 - YARN controls everything going on and is
mostly behind the scenes
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
• Analyzes raw data in HDFS where the data is
• Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapper’s output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value) pairs
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
“MapReduce is slow”
“MapReduce is hard to use”
Real-time Large-scale analyticsAd-hoc
MapReduce!
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Impala/HAWQ/Stinger
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Impala/HAWQ/Stinger Spark
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Storm/streaming Spark
#8 - MapReduce may be getting a bad rap, but it’s
still really important (but other engines are
important, too)
Real-time Large-scale analyticsAd-hoc
MapReduce!Spark
Not everyone has this problem, but it’s a really interesting problem!
Why is this so important?
What does it not do for me?
#8 - MapReduce may be getting a bad rap, but
it’s still really important
#9 - Hadoop is open source
Free – money isn’t just a financial barrier, but also a bureaucratic one, too
Help yourself – Hadoop is a complex system underneath and sometimes
you need to figure something out for yourself
Adoption – it’s easier to adopt, so adoption is more widespread
Expansion – can be extended by anyone
Why is this so important?
What does it not do for me?
#9 - Hadoop is open source
#10 - The Hadoop ecosystem is constantly growing and evolving
Not only do individual Hadoop
components improve…
But Hadoop overall improves with new
components that do new things
differently.
And they piece together into something
that gets a lot of work done.
Why is this so important?
What does it not do for me?
#10 - The Hadoop ecosystem is constantly growing and evolving
Play by Hadoop’s rules and it’ll give you what you want
10 concepts the enterprise
decision maker needs to
understand about Hadoop
Donald Miner
Strata + Hadoop World 2016 – San Jose
March 31st, 2016
dminer@minerkasch.com
@donaldpminer

Más contenido relacionado

La actualidad más candente

Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 

La actualidad más candente (18)

Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

Destacado

An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to AccumuloDonald Miner
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New CurrencyDonald Miner
 
Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataDonald Miner
 
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain   Big Data Baby Steps (4-12-2014) FinalUtah Big Mountain   Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain Big Data Baby Steps (4-12-2014) FinalNick Baguley
 
Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4Andy Moore
 
Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-Spark Summit
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Internet of Things (IoT) and Big Data
Internet of Things (IoT) and Big DataInternet of Things (IoT) and Big Data
Internet of Things (IoT) and Big DataGuido Schmutz
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 

Destacado (20)

An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
 
Data, The New Currency
Data, The New CurrencyData, The New Currency
Data, The New Currency
 
Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain   Big Data Baby Steps (4-12-2014) FinalUtah Big Mountain   Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
 
Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4
 
Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Internet of Things (IoT) and Big Data
Internet of Things (IoT) and Big DataInternet of Things (IoT) and Big Data
Internet of Things (IoT) and Big Data
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 

Similar a 10 concepts the enterprise decision maker needs to understand about Hadoop

BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoopAditi Yadav
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHarsha Siva Sai
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 

Similar a 10 concepts the enterprise decision maker needs to understand about Hadoop (20)

BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 

Último

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Último (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

10 concepts the enterprise decision maker needs to understand about Hadoop

  • 1. 10 concepts the enterprise decision maker needs to understand about Hadoop Donald Miner Strata + Hadoop World 2016 – San Jose March 31st, 2016
  • 3. Purpose of this talk An honest and minimal introduction to Hadoop Why is Hadoop popular? What does Hadoop do well and why? What is bad about Hadoop?
  • 4. #1 - Hadoop masks being a distributed system
  • 5. #1 - Hadoop masks being a distributed system // This block of code defines the behavior of the map phase public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { // Split the line of text into words StringTokenizer itr = new StringTokenizer(value.toString()); // Go through each word and send it while (itr.hasMoreTokens()) { word.set(itr.nextToken()); // "I've seen this word once!" context.write(word, one); } } [1]$ hadoop fs -put hamlet.txt datz/hamlet.txt [2]$ hadoop fs -put macbeth.txt data/macbeth.txt [3]$ hadoop fs -mv datz/hamlet.txt data/hamlet.txt [4]$ hadoop fs -ls data/ -rw-r–r– 1 don don 139k 2012-01-31 23:49 /user/don/data/caesar.txt -rw-r–r– 1 don don 180k 2013-09-25 20:45 /user/don/data/hamlet.txt -rw-r–r– 1 don don 117k 2013-09-25 20:46 /user/don/data/macbeth.txt
  • 6. #1 - Hadoop masks being a distributed system Why is this so important? What does it not do for me?
  • 7. #2 - Hadoop scales out linearly The amount of data, the amount of time something takes, and the amount of hardware you have are linearly linked1 1. usually
  • 8. #2 - Hadoop scales out linearly Double the compute, Half the time!
  • 9. #2 - Hadoop scales out linearly Double the data, twice the time!
  • 10. #2 - Hadoop scales out linearly Double the compute, Double the compute The same time!
  • 11. #2 - Hadoop scales out linearly Data locality!
  • 12. #2 - Hadoop scales out linearly Why is this so important? What does it not do for me?
  • 13. #3 - Hadoop runs on commodity hardware
  • 14. #3 - Hadoop runs on commodity hardware • Non-proprietary • Easy to acquire (all it takes is $) • Value (not necessarily cheap) • Let software handle the hard problems
  • 15. #3 - Hadoop runs on commodity hardware Why is this so important? What does it not do for me?
  • 16. #4 - Hadoop handles unstructured data Query languages like SQL assume some sort of structure Relational databases and other databases require structure MapReduce/Spark is just Java/Scala/Python/etc You can do anything Java can do HDFS just stores files You can store anything in a file
  • 17. #4 - Hadoop handles unstructured data Why is this so important? What does it not do for me?
  • 18. #5 - In Hadoop, you load data first and ask questions later BEFORE: ETL Years of planning Schemas & ER Diagrams
  • 19. LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS WITH HADOOP: #5 - In Hadoop, you load data first and ask questions later
  • 20. #5 - In Hadoop, you load data first and ask questions later
  • 21. Why is this so important? What does it not do for me? #5 - In Hadoop, you load data first and ask questions later
  • 22. #6 - HDFS stores the data but has some major limitations • Stores files in folders • Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicas of each block • Blocks are scattered all over the place • Can scale to thousands of nodes and hundreds of petabytes FILE BLOCKS
  • 23. #6 - HDFS stores the data but has some major limitations Limitations: • Low IOPs • Higher latency • Can’t edit files • Can’t handle small files • Low storage efficiency (33%) • Low throughput on single files • But… • High aggregate throughput • Massive scale • Software only • Few bottlenecks
  • 24. Why is this so important? What does it not do for me? #6 - HDFS stores the data but has some major limitations
  • 25. #7 - YARN controls everything going on and is mostly behind the scenes • Controls the compute resources on the cluster • Was the key new feature in Hadoop 2.0 • Abstracted resource management from MapReduce to be more general • MapReduce became just any other application • YARN is key in enabling multiple compute engines at once
  • 26. Why is this so important? What does it not do for me? #7 - YARN controls everything going on and is mostly behind the scenes
  • 27. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs
  • 28. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) “MapReduce is slow” “MapReduce is hard to use”
  • 29. Real-time Large-scale analyticsAd-hoc MapReduce! #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too)
  • 30. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming
  • 31. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming Impala/HAWQ/Stinger
  • 32. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming Impala/HAWQ/Stinger Spark
  • 33. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Storm/streaming Spark
  • 34. #8 - MapReduce may be getting a bad rap, but it’s still really important (but other engines are important, too) Real-time Large-scale analyticsAd-hoc MapReduce!Spark Not everyone has this problem, but it’s a really interesting problem!
  • 35. Why is this so important? What does it not do for me? #8 - MapReduce may be getting a bad rap, but it’s still really important
  • 36. #9 - Hadoop is open source Free – money isn’t just a financial barrier, but also a bureaucratic one, too Help yourself – Hadoop is a complex system underneath and sometimes you need to figure something out for yourself Adoption – it’s easier to adopt, so adoption is more widespread Expansion – can be extended by anyone
  • 37. Why is this so important? What does it not do for me? #9 - Hadoop is open source
  • 38. #10 - The Hadoop ecosystem is constantly growing and evolving Not only do individual Hadoop components improve… But Hadoop overall improves with new components that do new things differently. And they piece together into something that gets a lot of work done.
  • 39. Why is this so important? What does it not do for me? #10 - The Hadoop ecosystem is constantly growing and evolving
  • 40. Play by Hadoop’s rules and it’ll give you what you want
  • 41. 10 concepts the enterprise decision maker needs to understand about Hadoop Donald Miner Strata + Hadoop World 2016 – San Jose March 31st, 2016 dminer@minerkasch.com @donaldpminer

Notas del editor

  1. Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
  2. Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
  3. Importance: Get more done faster Barrier of entry Downsides: Knowing what you are doing Abstraction bleeding through
  4. Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  5. Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  6. Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  7. Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  8. Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
  9. Importance: Code stays the same as your cluster and problem grows Massively scalable Downsides: Need to do things in a linear way It’s not always true
  10. Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
  11. Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
  12. importance: Ease of accessibility Cloud Downsides: Sometimes have a hard time leveraging fancier hardware
  13. Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
  14. importance: Unstructured data Downsides: Cost of flexibility
  15. In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
  16. In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
  17. In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
  18. Importance Solve the chicken + egg Cost of flexibility
  19. HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
  20. HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
  21. importance: Scalable data storage that works for analytics Downsides: It’s bad storage
  22. YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
  23. Importance: YARN brings people closer to universal distributed system without getting in the way (same path) Downsides: Cost of abstraction + system complication
  24. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  25. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  26. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  27. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  28. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  29. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  30. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  31. MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
  32. Importance: MapReduce – fault tolerance, long running jobs, reliability Other parts of the ecosystem work together to solve a problem Downside: Lack of a universal interface – Spark? Holy grail?
  33. Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
  34. Importance: Organic growth Competition Downsides: ????
  35. The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
  36. Importance: Innovation Organization Downside: Fractured Hard to track Lack of cohesiveness