Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates

Boris Lublinsky / NAVTEQ
Mark Slusar / NAVTEQ
Mike Segel / Segel & Assoc.

Boris Lublinsky
• 25+ yrs experience as an enterprise architect with a focus on
end to end solutions, distributed systems, SOA, BPM, etc
• InfoQ SOA editor, OASIS member, writer, speaker
• FermiLab, SSA, Platinum, CNA , Navteq, et al
Mike Segel
• 20+ yrs experience in the IT Industry with a focus high
powered computing, information management, and
philosophy
• Founder of Chicago Hadoop User Group (an excuse to drink
beer and eat pizza  )
• Clients include NAVTEQ, Orc, IBM, Informix, Montgomery
Ward, CCC, and others…
Mark Slusar
• 15 yrs experience with a background of design, technology,
and leadership
• Sponsor of Chicago Hadoop User Group
• Federal Reserve, NEC, United Airlines, NAVTEQ, et al

 This presentation is based on our 2+ years of Hadoop
Projects onboarding experiences

 Part 1: Tactics to „sell‟ Hadoop to Stakeholders and Senior
Management
• Understanding what is Hadoop
• Alignment of goals
• Picking project
• Level Setting Expectations

 Part 2; Running a Successful Development Project
• Training
• Preparation & Planning Activities
• Development & Test Activities
• Deployment & Operations Activities

• Define the problem:
• Understanding company‟s pain(s)
• Finding the right problem to solve
• Low Hanging Fruit
• High Value
• High Visibility
• Don‟t bet the Farm.
• Create a problem statement
 Sell the Solution(s) and not a Technology
• Selling is an educational process
• Understand that Hadoop is a tool, not a panacea „cure
all‟

Hadoop Not Hadoop

Large data storage Real-time data processing
 Bringing execution to the
(difficult)
data Data Set is not large enough
Structured and Processing
algorithm not
unstructured data compatible w M/R
Massively Parallel Existing processes are well
processing suited to solve the problem.
Extensible ecosystem ACID Requirements
(Transaction Based)
One person doing a million
things vs. one million
people doing one thing.

 Set realistic goals
 Set boundaries
 Avoid scope creep
 Embrace what you don‟t know:
• Honest evaluation of you and your team‟s skills
• Hadoop is a paradigm shift therefore you need to alter your
approach to solving the problem.
•Level Set Expectations
• Technology is new to the organization
• There is a learning curve
• TANSTAAFL (There ain't no such thing as a free lunch.)
• Think for yourself: take Hadoop urban legends with a grain
salt

 The sales process takes time.
 Selling is an educational process
 For you:
• Learn the Stakeholders Pain
• Determine the Scope of the problem
• Formulate your own estimates
 For your Stakeholders:
• Must „buy in‟ to your solution.
• Appreciate the underlying technology
• Understand the risks

 Don‟t oversell and underestimate

 Reaffirm the stated pains and any identified latent
pain(s).
 Give your audience time to digest the presented
information.
 Show how the solution solves their problem

 Avoid „The Bottom Line‟

 Understand common objections and overcome
them.
• “…We can do this in a RDBMS …”
• “…This sounds risky…”
• “… Who else is doing this? … “
• “… Who‟s using it in production? …”
• “… Sounds expensive … “

Talking points included at the end of the slide presentation

 Executive Sponsorship – Identify the key players and understand
their „pains‟.

 Project is Sufficiently Funded

 Project Charter – The project is well defined with set goals and
expectations.

 Level Set Expectations: The technology is new to your company,
and it should be expected that you will face setbacks during the
project. (Lower the expectations to a point where you know you
can exceed them.)

 Outside Expertise. (Buy/Build/Blended Model)

 Resources have been identified and have been dedicated to this
project.

 Business Analysts Support – have a good understanding of data
and access patterns is essential.

 Architecture – Hadoop is a paradigm shift. It is essential to reflect
it in a solution architecture. Integration with existing enterprise
applications can provide additional challenges

 Developers – Candidates (Java/Unix Proficiency with a myriad of
data-driven projects under their belt). Ability and desire to learn
new tricks.

 Infrastructure Support – have Hadoop administrators who are
experienced and/or capable of learning.

 Training – Not just APIs, but also Hadoop concepts and patterns.

 Hadoop is an unregistered TM of Apache
 There exist several companies that provide commercial
support for Hadoop and Hadoop derivatives.
• Cloudera
• MapRTech
• HortonWorks
• Others (HStreaming, DataStax, …)
 And there is also Amazon…

 Application - Walk through the business process and create an simple
plain English outline of what you want to achieve in each step.

 Hardware - Determine your initial data set(s) and design out your
cluster accordingly.

 Design & Development are iterative processes.
 Your first iteration is rarely your last iteration.
 Don‟t be embarrassed by your code. Share it with others for feedback
and improvement.

 KISS, KISS, KISS

 Data storage - Which to use: HDFS or Hbase?

HDFS HBase

Use HDFS when you are always going to Use HBase when you want random
access your data as an entire set or a access to your data set. Access
very large subset. individual records, partial records,
HDFS access is sequential read only. and subsets of records. HBase
provides more control over
HDFS supports only create and append
partitioning data.
HDFS is mainly used in Map/Reduce.
Direct access from the client is HBase supports get, put, update and
possible, but typically requires scan of sequential keys
indexing. It provides language HBase can be accessed from either a
(Java) APIs only Map/Reduce program, or directly from
a client. It supports Java, REST and
When using HDFS you always want large
Thrift APIs
(GB) size files.
HBase provides build in versioning
Packaging smaller sized files into larger
and purging of data.
ones requires development efforts.
Many new enhancements are coming.
Coprocessors is the most significant one

 Automate your Environment Setup
 Use Puppet, Chef, Cloudera Enterprise Manager, etc…
 Rely on Hadoop Ecosystem whenever possible.

 Configuration
• See Mike Guenther‟s Lecture (CHUG Archive)
• Use Cloudera Docs
• Configuration is a continuous process
• Tune both the cluster and application independently.
• Don‟t optimize your cluster for your application, optimize your
application for your cluster.

 Plan your Development Iterations
• Data storage Model
• ETL (loading data in/out of Hadoop)
• Automate Environment Setup
• Processing
• Integration (interacting with other enterprise applications)
• Reporting interface & diagnostics to show speed and utilization

 Understand Map Reduce model and patterns – read Jimmy Lin
and Chris Dyer book Data-Intensive Text Processing with
MapReduce.
 See if you really need reducers (they are expensive) and if you
do, try to use combiners
 Use custom InputFormat if you need better control execution of
Maps
 Programmatic writes to predetermine files might lead to
unpredictable results.
 Use Oozie for orchestrating multiple Map Reduce jobs.
 Use Oozie for automatically starting your jobs when data arrives
 Don‟t be afraid to ask for help.

 Be prepared to re-factor your code many times. You often start
wrong, but your goal is to end right.

 Tom White‟s Hadoop Book

 Lars George‟s Hbase Book

 In addition to MapReduce, Investigate additional Hadoop
technologies (Pig, Hive, Flume, et al)

 Be prescriptive, use only the technology you really need

 Don‟t forget about the community, they will be extremely
helpful. See (http://www.meetup.com/Chicago-area-Hadoop-
User-Group-CHUG/ ) [Shameless plug.  ]

 Unit Test the Application and the Interface
 Test Hadoop – report issues to Cloudera.
 Opening Support Tickets* – life saver for new teams. (Cloudera
offers support contracts )
 Optimize your application, not the cluster
 End to End Testing – it matters, it ensures confidence
 Performance testing – its one of the drivers of the project.
 Make Sure you test on realistic data volumes – results can be
deceiving on smaller data sets.
 Showcase the ability of the cluster compared to existing
systems
 Consulting – look over your application, but do not outsource
implementation to consultants. Make sure you build internal
knowledge

*Assumes that you have a corporate license…

 SLAs – Not advisable for Hadoop Project #1
 Involve Deployment & Operations personnel from the get-go;
they will be supporting it
 Operations Team :
• Hadoop Administration Training
• Operations Team – Data Analysts & Users trained and
involved with process as stakeholders
 Data Maintenance – The role of the DBA begins to
change, existing DBAs should have interest in Hadoop
 Playbooks – should help address many Hadoop related issues
without involving developers & architects
 UATs – use as needed and depending on methodology

 What worked well in the first project?

 What did not work?

 Ready to process Mission Critical Data?

 Begin to establish SLAs?

 Consider real-time data delivery?

 Ready to support enterprise data?

http:// hadoop.apache.org/ (Apache Hadoop)
http:// www.cloudera.com/ (Cloudera)
http://www.meetup.com/Chicago-area-Hadoop-User-Group-
CHUG/ (CHUG)

Or find Mike, Boris, or Mark on Linkedin

• Scalability – A large data problem can broken into many pieces
processed in parallel by 10, 100, or 1000 machines; all while
working for a common goal. Adding more machines improves
scalability.

• Incredible Performance – Hadoop holds the performance record
for data processing (terabyte sort in 209 seconds – yahoo)

• Data integrity – Data is stored multiple times across nodes.

• Separation of concerns – developers need to write only business
code – mappers and reducers. All infrastructure “heavy lifting” &
job management is done by the framework.

• Yahoo – Content Optimization, Sorting, Ad Placement

• Facebook – Largest Hadoop Cluster, Terabytes of insights
processed per DAY. Social email.

• LinkedIn – Computationally Intensive operations for Enterprise
Data: “People You May Know”, “Viewers of this Profile Also
Viewed”, “Job Recommendations”

• Groupon – Analytics and Data mining on “Extreme Data”

• Nokia- See http://www.cloudera.com/videos/apache-hadoop-
nokia-josh-devins

• For more companies see:
http://wiki.apache.org/hadoop/PoweredBy

• Massive data storage – ability to correlate seemingly disparate
data. Ability to store lots of historical data.

• Computational Power – Ability to run reports and ask questions
that could previously not be asked – asking “golden questions”

• Throughput – time to complete jobs allows even more “golden
questions”

• “Golden questions” – change the game, drive profits, and
positively disrupt businesses

• Commodity Resources - Nodes cost as much as a workstation.
No specialized hardware.

 Expenditures - No software purchases, no negotiations with
vendors, no licensing headaches – free downloads. (For initial
PoC installation.)

• Easily proved - Proof of Concept can be executed in a
virtualized environment or at a public cloud.

Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates

Recomendados

Recomendados

Más contenido relacionado

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates

Notas del editor