2. BEFORE WE BEGIN
Questions for the
audience….
How Many of You
have :
Been working with Hadoop for more than 3
months?
Been working with Hadoop for more than 6
months?
Been working with Hadoop for more than 1
year?How many of you have heard about this thing
called
‘Hadoop’ / ‘Big Data’ and thought it would be fun
to check it out?
3. About the Speaker
BSCIS - The College of Engineering, The Ohio State University
‘Big Data’ Consultant with > 25 years in IT
Working solely in the ‘Big Data’ space since 2009
Founded Chicago area Hadoop User Group (CHUG) in April 2010
1600+ Members
Over 200 different companies across all industries in the Chicagoland area.
Routinely has talked at different Conferences around the US on Hadoop.
Guest Lecture at Illinois Institute of Technology.
CoAuthored papers found on InfoQ.
MapR Admin, Cloudera Admin & Developer Certified.
3
email: MSegel (at)
segel.com
Skype: Michael_Segel
4. What is Hadoop?
‘A Framework of software tools to allow one to take a
large problem and process individual pieces in
parallel. ‘
4
5. Our Hadoop Layer Cake:
Circa 2010
Storag
e
Job
Control
Data Access
5
Programmin
g
Languages
6. Data Access
Our Hadoop Layer Cake:
Circa 2013 Hadoop 2.0
Storag
e
Job
Control
6
Resourc
e
Control
Real
Time
Messag
es
Confused?
This is just the tip of the
iceberg.
Data
Frameworks
7. The only constant is
change…
Hadoop is a disruptive technology, forcing the enterprise
to rethink how it handles data.
The core Apache Framework is just the starting point.
Disruption allows new vendors to compete with
established vendors.
If you can build a better mousetrap, you will attract
customers.
Hadoop plays nice with others…
8. PROPRIETARY SOFTWARE IS BAD.
“Qu’ils mangent de la
brioche.”
8
‘Let them eat
cake’
Myth
:
Reality
:VENDOR LOCK IN IS BAD.
9. HADOOP IS ONLY GOOD FOR BATCH
PROCESSING
“Qu’ils mangent de la
brioche.”
9
‘Let them eat
cake’
Myth
:
Reality
:HADOOP CAN ALSO BE USED FOR ‘REAL TIME’
PROBLEMS.
10. [CENSOR
ED]
PROJE
CT
DAT
E
CLIE
NT
REAL TIME HADOOP
SINGLE DATA CENTER SOLUTION
Nightly Batch Jobs Create the
Next Days Advertising Lists
Client Phone Connects to the web
serviceWeb Service talks to Ad
EnginePhone connects to Ad Engine to
get Ad
Ad Engine connects to HBase to
get list of potential Ads to display,
sending the correct Ad to phone.
11. HADOOP IS A STAND ALONE SYSTEM AND WILL REPLACE
TRADITIONAL VENDOR’S PRODUCTS
“Qu’ils mangent de la
brioche.”
11
‘Let them eat
cake’
Myth
:
Reality
:HADOOP IS PART OF THE ENTERPRISE . IT CAN BE
STANDALONE, OR IT CAN WORK WITH EXISTING
INFRASTRUCTURE.
12. PROJE
CT
DAT
E
CLIE
NT
TOD
AY
HADOOP AND THE
ENTERPRISE
WE CAN ALL GET ALONG….
Hadoop communicates
well with the rest of the
Enterprise…
Central cluster feeds
distributed web services
with local database
backing…
[split in to two
slides]
13. PROJE
CT
DAT
E
CLIE
NT
TOD
AY
HADOOP AND THE
ENTERPRISE
WE CAN ALL GET ALONG….
Hadoop communicates
well with the rest of the
Enterprise…
Traditional Data
Stores play nice with
Hadoop. Some seeing
HDFS files as external
tables.
[split in to two
slides]
14. How Traditional Vendors view
Hadoop
In the beginning they saw Hadoop as a threat.
They will crush them.
If you can’t beat them, join them….
Oracle Partners with Cloudera
EMC partnered with MapR, then released its own distribution. (Green Stack)
Terradata partners with Hortonworks.
Microsoft partnered with Hortonworks.
Intel
Tried to create their own distro.
Last week, dumped their distro, made large investment in to Cloudera.
IBM … Has its own distro, yet certifies their tools to run on Cloudera
Cisco partners with MapR
Amazon (AWS) has own distro, Partners with MapR.
15. HADOOP CLUSTERS SHOULD BE BUILT ON COMMODITY
HARDWARE .
“Qu’ils mangent de la
brioche.”
15
‘Let them eat
cake’
Myth
:
Reality
:YOU CAN DESIGN YOUR CLUSTER AROUND
CONSTRAINTS…
17. HADOOP HADOOP IS OPEN SOURCE AND
THEREFORE FREE.
“Qu’ils mangent de la
brioche.”
17
‘Let them eat cake’
Myth
:
Reality
:T.A.N.S.T.A.A.F.L ‘TANS - TAH - FELL’
(THERE AINT NO SUCH THING AS A FREE LUNCH )
18. There aint no such thing as a free
lunch…
Customers are paying for support.
Tools are primitive, requires work, no real point and click
solution in place, but getting there.
Hadoop fills the gap where you want a custom solution.
Merging semi-structured and structured data is going to be
data dependent, requiring customization.
Beyond ETL, SQL, custom apps require developer
expertise. (You must invest in skills. )
Depending on Use Case, Time to Value (TtV) will differ.
Bottom Line, there is a cost reduction over traditional
solutions, but its not free.
19. Take away…
Hadoop is a tool set that is constantly evolving.
Beware of marketing myths…
Do your own homework and talk to the vendors.
Make them earn your business.
T.A.S.T.A.A.F.L applies, you need to make an investment in terms of skills.
Hadoop isn’t a separate solution and should be part of your overall
Enterprise strategy.
Hadoop isn’t a silver bullet. By itself, it doesn’t solve your business
problems.
22. What is a layer cake?
layer cake
noun [C] US
: two or more soft cakes put on top of each other with
jam, cream, icing, etc. (= a sweet mixture made from
sugar) between the cakes and covering the top and
sides
: a term for a diagram showing how various
parts of a group of components tie together
in terms of a functional stack.
22
23. What is Hadoop?
Storage Layer
The Storage Layer is a Distributed File System that
accomplishes the following:
Uniform Access from any machine in the cluster.
Fast Access (
Resiliency (Self Healing)
Redundancy (Replication)
This is known as HDFS - Hadoop File System
24. What is Hadoop?
Job Control Layer
The Job Control Layer is the layer that accomplishes the following:
Manages and Schedules Jobs to be run. (Default [FIFO],
Capacity Scheduler,
Manages the over all job, and distributes the subprocesses
across the cluster.
Manages the subprocesses being run on each node in the
cluster.
This is accomplished by a Job Tracker (Cluster level) and Task
Tracker (Node Level)
25. What is Hadoop?
Data Access Layer
The Data Access Layer is the layer that accomplishes the
following:
Allows for a higher level access which can be
translated to a Map/Reduce Job
Pig (Yahoo!)
Hive (Facebook)
Allows for Adhoc access to data outside of the
Map/Reduce Framework (HBase)
26. What is Hadoop?
Job Flow Control Layer
The Data Access Layer is the layer that accomplishes the following:
Allows for a higher level access which can be translated to a
Map/Reduce Job
Pig (Yahoo!)
Hive (Facebook)
Allows for Adhoc access to data outside of the Map/Reduce
Framework (HBase)
Allows for processes to be chained together to create a work
flow (Oozie)*
*No where else to put it…
27. List of Apache Incubator
Projects associated with
Hadoop:
Storm
Accumulo
Knox
Sentry
Falcon
DataFu
Drill
Tez
Twill
Phoenix
Hadoop Dev Tools
Tajo