SlideShare una empresa de Scribd logo
1 de 27
Boris Lublinsky / NAVTEQ
     Mark Slusar / NAVTEQ
Mike Segel / Segel & Assoc.
Boris Lublinsky
  • 25+ yrs experience as an enterprise architect with a focus on
     end to end solutions, distributed systems, SOA, BPM, etc
  • InfoQ SOA editor, OASIS member, writer, speaker
  • FermiLab, SSA, Platinum, CNA , Navteq, et al
Mike Segel
  • 20+ yrs experience in the IT Industry with a focus high
     powered computing, information management, and
     philosophy
  • Founder of Chicago Hadoop User Group (an excuse to drink
     beer and eat pizza  )
  • Clients include NAVTEQ, Orc, IBM, Informix, Montgomery
     Ward, CCC, and others…
Mark Slusar
  • 15 yrs experience with a background of design, technology,
     and leadership
  • Sponsor of Chicago Hadoop User Group
  • Federal Reserve, NEC, United Airlines, NAVTEQ, et al
   This presentation is based on our 2+ years of Hadoop
    Projects onboarding experiences

   Part 1: Tactics to „sell‟ Hadoop to Stakeholders and Senior
    Management
    • Understanding what is Hadoop
    • Alignment of goals
    • Picking project
    • Level Setting Expectations

   Part 2; Running a Successful Development Project
    • Training
    • Preparation & Planning Activities
    • Development & Test Activities
    • Deployment & Operations Activities
•   Define the problem:
    • Understanding company‟s pain(s)
    • Finding the right problem to solve
      • Low Hanging Fruit
      • High Value
      • High Visibility
    • Don‟t bet the Farm.
    • Create a problem statement
    Sell the Solution(s) and not a Technology
    • Selling is an educational process
    • Understand that Hadoop is a tool, not a panacea „cure
      all‟
Hadoop                       Not Hadoop

Large   data storage         Real-time   data processing
 Bringing execution to the
                                (difficult)
    data                      Data   Set is not large enough
 Structured and              Processing
                                        algorithm not
    unstructured data         compatible w M/R
 Massively Parallel          Existing processes are well
    processing                suited to solve the problem.
 Extensible ecosystem        ACID Requirements
                              (Transaction Based)
                              One person doing a million
                                things vs. one million
                                people doing one thing.
 Set realistic goals
 Set boundaries
 Avoid scope creep
 Embrace what you don‟t know:
   • Honest evaluation of you and your team‟s skills
   • Hadoop is a paradigm shift therefore you need to alter your
    approach to solving the problem.
  •Level Set Expectations
   •    Technology is new to the organization
   •    There is a learning curve
   •    TANSTAAFL (There ain't no such thing as a free lunch.)
   •    Think for yourself: take Hadoop urban legends with a grain
        salt
   The sales process takes time.
   Selling is an educational process
     For you:
      • Learn the Stakeholders Pain
      • Determine the Scope of the problem
      • Formulate your own estimates
     For your Stakeholders:
      • Must „buy in‟ to your solution.
      • Appreciate the underlying technology
      • Understand the risks

   Don‟t oversell and underestimate
 Reaffirm the stated pains and any identified latent
  pain(s).
 Give your audience time to digest the presented
  information.
 Show how the solution solves their problem

 Avoid „The Bottom Line‟

 Understand common objections and overcome
  them.
    •   “…We can do this in a RDBMS …”
    •   “…This sounds risky…”
    •   “… Who else is doing this? … “
    •   “… Who‟s using it in production? …”
    •   “… Sounds expensive … “

 Talking points included at the end of the slide presentation
   Executive Sponsorship – Identify the key players and understand
    their „pains‟.

   Project is Sufficiently Funded

   Project Charter – The project is well defined with set goals and
    expectations.

   Level Set Expectations: The technology is new to your company,
    and it should be expected that you will face setbacks during the
    project. (Lower the expectations to a point where you know you
    can exceed them.)

   Outside Expertise. (Buy/Build/Blended Model)
   Resources have been identified and have been dedicated to this
    project.

     Business Analysts Support – have a good understanding of data
      and access patterns is essential.

     Architecture – Hadoop is a paradigm shift. It is essential to reflect
      it in a solution architecture. Integration with existing enterprise
      applications can provide additional challenges

     Developers – Candidates (Java/Unix Proficiency with a myriad of
      data-driven projects under their belt). Ability and desire to learn
      new tricks.

     Infrastructure Support – have Hadoop administrators who are
      experienced and/or capable of learning.

   Training – Not just APIs, but also Hadoop concepts and patterns.
   Hadoop is an unregistered TM of Apache
   There exist several companies that provide commercial
    support for Hadoop and Hadoop derivatives.
      • Cloudera
      • MapRTech
      • HortonWorks
      • Others (HStreaming, DataStax, …)
   And there is also Amazon…
   Application - Walk through the business process and create an simple
    plain English outline of what you want to achieve in each step.

   Hardware - Determine your initial data set(s) and design out your
    cluster accordingly.

   Design & Development are iterative processes.
     Your first iteration is rarely your last iteration.
     Don‟t be embarrassed by your code. Share it with others for feedback
      and improvement.

   KISS, KISS, KISS

   Data storage - Which to use: HDFS or Hbase?
HDFS                                        HBase

Use HDFS when you are always going to         Use HBase when you want random
access your data as an entire set or a        access to your data set. Access
very large subset.                            individual records, partial records,
HDFS access is sequential read only.          and subsets of records. HBase
                                              provides more control over
HDFS supports only create and append
                                              partitioning data.
HDFS is mainly used in Map/Reduce.
Direct access from the client is              HBase supports get, put, update and
possible, but typically requires              scan of sequential keys
indexing. It provides language                HBase can be accessed from either a
(Java) APIs only                              Map/Reduce program, or directly from
                                              a client. It supports Java, REST and
When using HDFS you always want large
                                              Thrift APIs
(GB) size files.
                                              HBase provides build in versioning
Packaging smaller sized files into larger
                                              and purging of data.
ones requires development efforts.
                                              Many new enhancements are coming.
                                            Coprocessors is the most significant one
    Automate your Environment Setup
     Use Puppet, Chef, Cloudera Enterprise Manager, etc…
     Rely on Hadoop Ecosystem whenever possible.

    Configuration
    • See Mike Guenther‟s Lecture (CHUG Archive)
    • Use Cloudera Docs
    • Configuration is a continuous process
    • Tune both the cluster and application independently.
    • Don‟t optimize your cluster for your application, optimize your
       application for your cluster.

   Plan your Development Iterations
    • Data storage Model
    • ETL (loading data in/out of Hadoop)
    • Automate Environment Setup
    • Processing
    • Integration (interacting with other enterprise applications)
    • Reporting interface & diagnostics to show speed and utilization
   Understand Map Reduce model and patterns – read Jimmy Lin
    and Chris Dyer book Data-Intensive Text Processing with
    MapReduce.
   See if you really need reducers (they are expensive) and if you
    do, try to use combiners
   Use custom InputFormat if you need better control execution of
    Maps
   Programmatic writes to predetermine files might lead to
    unpredictable results.
   Use Oozie for orchestrating multiple Map Reduce jobs.
   Use Oozie for automatically starting your jobs when data arrives
   Don‟t be afraid to ask for help.
   Be prepared to re-factor your code many times. You often start
    wrong, but your goal is to end right.

   Tom White‟s Hadoop Book

   Lars George‟s Hbase Book

   In addition to MapReduce, Investigate additional Hadoop
    technologies (Pig, Hive, Flume, et al)

   Be prescriptive, use only the technology you really need

   Don‟t forget about the community, they will be extremely
    helpful. See (http://www.meetup.com/Chicago-area-Hadoop-
    User-Group-CHUG/ ) [Shameless plug.  ]
   Unit Test the Application and the Interface
   Test Hadoop – report issues to Cloudera.
   Opening Support Tickets* – life saver for new teams. (Cloudera
    offers support contracts )
   Optimize your application, not the cluster
   End to End Testing – it matters, it ensures confidence
   Performance testing – its one of the drivers of the project.
     Make Sure you test on realistic data volumes – results can be
      deceiving on smaller data sets.
     Showcase the ability of the cluster compared to existing
      systems
   Consulting – look over your application, but do not outsource
    implementation to consultants. Make sure you build internal
    knowledge




                                       *Assumes that you have a corporate license…
 SLAs   – Not advisable for Hadoop Project #1
 Involve   Deployment & Operations personnel from the get-go;
    they will be supporting it
   Operations Team :
    • Hadoop Administration Training
    • Operations Team – Data Analysts & Users trained and
     involved with process as stakeholders
    Data Maintenance – The role of the DBA begins to
    change, existing DBAs should have interest in Hadoop
   Playbooks – should help address many Hadoop related issues
    without involving developers & architects
   UATs – use as needed and depending on methodology
   What worked well in the first project?

   What did not work?

   Ready to process Mission Critical Data?

   Begin to establish SLAs?

   Consider real-time data delivery?

   Ready to support enterprise data?
http:// hadoop.apache.org/ (Apache Hadoop)
http:// www.cloudera.com/ (Cloudera)
http://www.meetup.com/Chicago-area-Hadoop-User-Group-
  CHUG/ (CHUG)


Or find Mike, Boris, or Mark on Linkedin
Appendix
•   Scalability – A large data problem can broken into many pieces
    processed in parallel by 10, 100, or 1000 machines; all while
    working for a common goal. Adding more machines improves
    scalability.

•   Incredible Performance – Hadoop holds the performance record
    for data processing (terabyte sort in 209 seconds – yahoo)

•   Data integrity – Data is stored multiple times across nodes.

•   Separation of concerns – developers need to write only business
    code – mappers and reducers. All infrastructure “heavy lifting” &
    job management is done by the framework.
•   Yahoo – Content Optimization, Sorting, Ad Placement

•   Facebook – Largest Hadoop Cluster, Terabytes of insights
    processed per DAY. Social email.

•   LinkedIn – Computationally Intensive operations for Enterprise
    Data: “People You May Know”, “Viewers of this Profile Also
    Viewed”, “Job Recommendations”

•   Groupon – Analytics and Data mining on “Extreme Data”

•   Nokia- See http://www.cloudera.com/videos/apache-hadoop-
    nokia-josh-devins

•   For more companies see:
    http://wiki.apache.org/hadoop/PoweredBy
•   Massive data storage – ability to correlate seemingly disparate
    data. Ability to store lots of historical data.

•   Computational Power – Ability to run reports and ask questions
    that could previously not be asked – asking “golden questions”

•   Throughput – time to complete jobs allows even more “golden
    questions”

•   “Golden questions” – change the game, drive profits, and
    positively disrupt businesses
•   Commodity Resources - Nodes cost as much as a workstation.
    No specialized hardware.

   Expenditures - No software purchases, no negotiations with
    vendors, no licensing headaches – free downloads. (For initial
    PoC installation.)

•   Easily proved - Proof of Concept can be executed in a
    virtualized environment or at a public cloud.

Más contenido relacionado

Más de Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceCloudera, Inc.
 

Más de Cloudera, Inc. (20)

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Hadoop World 2011: Practical Knowledge for Your First Hadoop Project - Mark Slusar, NAVTEQ, Boris Lublinsky, NAVTEQ, & Mike Segel, Segel & Associates

  • 1. Boris Lublinsky / NAVTEQ Mark Slusar / NAVTEQ Mike Segel / Segel & Assoc.
  • 2. Boris Lublinsky • 25+ yrs experience as an enterprise architect with a focus on end to end solutions, distributed systems, SOA, BPM, etc • InfoQ SOA editor, OASIS member, writer, speaker • FermiLab, SSA, Platinum, CNA , Navteq, et al Mike Segel • 20+ yrs experience in the IT Industry with a focus high powered computing, information management, and philosophy • Founder of Chicago Hadoop User Group (an excuse to drink beer and eat pizza  ) • Clients include NAVTEQ, Orc, IBM, Informix, Montgomery Ward, CCC, and others… Mark Slusar • 15 yrs experience with a background of design, technology, and leadership • Sponsor of Chicago Hadoop User Group • Federal Reserve, NEC, United Airlines, NAVTEQ, et al
  • 3. This presentation is based on our 2+ years of Hadoop Projects onboarding experiences  Part 1: Tactics to „sell‟ Hadoop to Stakeholders and Senior Management • Understanding what is Hadoop • Alignment of goals • Picking project • Level Setting Expectations  Part 2; Running a Successful Development Project • Training • Preparation & Planning Activities • Development & Test Activities • Deployment & Operations Activities
  • 4.
  • 5. Define the problem: • Understanding company‟s pain(s) • Finding the right problem to solve • Low Hanging Fruit • High Value • High Visibility • Don‟t bet the Farm. • Create a problem statement  Sell the Solution(s) and not a Technology • Selling is an educational process • Understand that Hadoop is a tool, not a panacea „cure all‟
  • 6. Hadoop Not Hadoop Large data storage Real-time data processing  Bringing execution to the (difficult) data Data Set is not large enough Structured and Processing algorithm not unstructured data compatible w M/R Massively Parallel Existing processes are well processing suited to solve the problem. Extensible ecosystem ACID Requirements (Transaction Based) One person doing a million things vs. one million people doing one thing.
  • 7.  Set realistic goals  Set boundaries  Avoid scope creep  Embrace what you don‟t know: • Honest evaluation of you and your team‟s skills • Hadoop is a paradigm shift therefore you need to alter your approach to solving the problem. •Level Set Expectations • Technology is new to the organization • There is a learning curve • TANSTAAFL (There ain't no such thing as a free lunch.) • Think for yourself: take Hadoop urban legends with a grain salt
  • 8. The sales process takes time.  Selling is an educational process  For you: • Learn the Stakeholders Pain • Determine the Scope of the problem • Formulate your own estimates  For your Stakeholders: • Must „buy in‟ to your solution. • Appreciate the underlying technology • Understand the risks  Don‟t oversell and underestimate
  • 9.  Reaffirm the stated pains and any identified latent pain(s).  Give your audience time to digest the presented information.  Show how the solution solves their problem  Avoid „The Bottom Line‟  Understand common objections and overcome them. • “…We can do this in a RDBMS …” • “…This sounds risky…” • “… Who else is doing this? … “ • “… Who‟s using it in production? …” • “… Sounds expensive … “ Talking points included at the end of the slide presentation
  • 10.
  • 11. Executive Sponsorship – Identify the key players and understand their „pains‟.  Project is Sufficiently Funded  Project Charter – The project is well defined with set goals and expectations.  Level Set Expectations: The technology is new to your company, and it should be expected that you will face setbacks during the project. (Lower the expectations to a point where you know you can exceed them.)  Outside Expertise. (Buy/Build/Blended Model)
  • 12. Resources have been identified and have been dedicated to this project.  Business Analysts Support – have a good understanding of data and access patterns is essential.  Architecture – Hadoop is a paradigm shift. It is essential to reflect it in a solution architecture. Integration with existing enterprise applications can provide additional challenges  Developers – Candidates (Java/Unix Proficiency with a myriad of data-driven projects under their belt). Ability and desire to learn new tricks.  Infrastructure Support – have Hadoop administrators who are experienced and/or capable of learning.  Training – Not just APIs, but also Hadoop concepts and patterns.
  • 13. Hadoop is an unregistered TM of Apache  There exist several companies that provide commercial support for Hadoop and Hadoop derivatives. • Cloudera • MapRTech • HortonWorks • Others (HStreaming, DataStax, …)  And there is also Amazon…
  • 14. Application - Walk through the business process and create an simple plain English outline of what you want to achieve in each step.  Hardware - Determine your initial data set(s) and design out your cluster accordingly.  Design & Development are iterative processes.  Your first iteration is rarely your last iteration.  Don‟t be embarrassed by your code. Share it with others for feedback and improvement.  KISS, KISS, KISS  Data storage - Which to use: HDFS or Hbase?
  • 15. HDFS HBase Use HDFS when you are always going to Use HBase when you want random access your data as an entire set or a access to your data set. Access very large subset. individual records, partial records, HDFS access is sequential read only. and subsets of records. HBase provides more control over HDFS supports only create and append partitioning data. HDFS is mainly used in Map/Reduce. Direct access from the client is HBase supports get, put, update and possible, but typically requires scan of sequential keys indexing. It provides language HBase can be accessed from either a (Java) APIs only Map/Reduce program, or directly from a client. It supports Java, REST and When using HDFS you always want large Thrift APIs (GB) size files. HBase provides build in versioning Packaging smaller sized files into larger and purging of data. ones requires development efforts. Many new enhancements are coming. Coprocessors is the most significant one
  • 16. Automate your Environment Setup  Use Puppet, Chef, Cloudera Enterprise Manager, etc…  Rely on Hadoop Ecosystem whenever possible.  Configuration • See Mike Guenther‟s Lecture (CHUG Archive) • Use Cloudera Docs • Configuration is a continuous process • Tune both the cluster and application independently. • Don‟t optimize your cluster for your application, optimize your application for your cluster.  Plan your Development Iterations • Data storage Model • ETL (loading data in/out of Hadoop) • Automate Environment Setup • Processing • Integration (interacting with other enterprise applications) • Reporting interface & diagnostics to show speed and utilization
  • 17. Understand Map Reduce model and patterns – read Jimmy Lin and Chris Dyer book Data-Intensive Text Processing with MapReduce.  See if you really need reducers (they are expensive) and if you do, try to use combiners  Use custom InputFormat if you need better control execution of Maps  Programmatic writes to predetermine files might lead to unpredictable results.  Use Oozie for orchestrating multiple Map Reduce jobs.  Use Oozie for automatically starting your jobs when data arrives  Don‟t be afraid to ask for help.
  • 18. Be prepared to re-factor your code many times. You often start wrong, but your goal is to end right.  Tom White‟s Hadoop Book  Lars George‟s Hbase Book  In addition to MapReduce, Investigate additional Hadoop technologies (Pig, Hive, Flume, et al)  Be prescriptive, use only the technology you really need  Don‟t forget about the community, they will be extremely helpful. See (http://www.meetup.com/Chicago-area-Hadoop- User-Group-CHUG/ ) [Shameless plug.  ]
  • 19. Unit Test the Application and the Interface  Test Hadoop – report issues to Cloudera.  Opening Support Tickets* – life saver for new teams. (Cloudera offers support contracts )  Optimize your application, not the cluster  End to End Testing – it matters, it ensures confidence  Performance testing – its one of the drivers of the project.  Make Sure you test on realistic data volumes – results can be deceiving on smaller data sets.  Showcase the ability of the cluster compared to existing systems  Consulting – look over your application, but do not outsource implementation to consultants. Make sure you build internal knowledge *Assumes that you have a corporate license…
  • 20.  SLAs – Not advisable for Hadoop Project #1  Involve Deployment & Operations personnel from the get-go; they will be supporting it  Operations Team : • Hadoop Administration Training • Operations Team – Data Analysts & Users trained and involved with process as stakeholders  Data Maintenance – The role of the DBA begins to change, existing DBAs should have interest in Hadoop  Playbooks – should help address many Hadoop related issues without involving developers & architects  UATs – use as needed and depending on methodology
  • 21. What worked well in the first project?  What did not work?  Ready to process Mission Critical Data?  Begin to establish SLAs?  Consider real-time data delivery?  Ready to support enterprise data?
  • 22. http:// hadoop.apache.org/ (Apache Hadoop) http:// www.cloudera.com/ (Cloudera) http://www.meetup.com/Chicago-area-Hadoop-User-Group- CHUG/ (CHUG) Or find Mike, Boris, or Mark on Linkedin
  • 24. Scalability – A large data problem can broken into many pieces processed in parallel by 10, 100, or 1000 machines; all while working for a common goal. Adding more machines improves scalability. • Incredible Performance – Hadoop holds the performance record for data processing (terabyte sort in 209 seconds – yahoo) • Data integrity – Data is stored multiple times across nodes. • Separation of concerns – developers need to write only business code – mappers and reducers. All infrastructure “heavy lifting” & job management is done by the framework.
  • 25. Yahoo – Content Optimization, Sorting, Ad Placement • Facebook – Largest Hadoop Cluster, Terabytes of insights processed per DAY. Social email. • LinkedIn – Computationally Intensive operations for Enterprise Data: “People You May Know”, “Viewers of this Profile Also Viewed”, “Job Recommendations” • Groupon – Analytics and Data mining on “Extreme Data” • Nokia- See http://www.cloudera.com/videos/apache-hadoop- nokia-josh-devins • For more companies see: http://wiki.apache.org/hadoop/PoweredBy
  • 26. Massive data storage – ability to correlate seemingly disparate data. Ability to store lots of historical data. • Computational Power – Ability to run reports and ask questions that could previously not be asked – asking “golden questions” • Throughput – time to complete jobs allows even more “golden questions” • “Golden questions” – change the game, drive profits, and positively disrupt businesses
  • 27. Commodity Resources - Nodes cost as much as a workstation. No specialized hardware.  Expenditures - No software purchases, no negotiations with vendors, no licensing headaches – free downloads. (For initial PoC installation.) • Easily proved - Proof of Concept can be executed in a virtualized environment or at a public cloud.

Notas del editor

  1. This is our obligatory slide that tells you who we are and that some of us are really old and have been doing this for far too long.Everyone does their own, so that the audience know who we areMaybe have everyone introduce themselves, but I really don’t want to pimp myself. [Mikey]
  2. [Mikey] I want to preface this slide by stating that the ideal audience for this presentation is for someone who’s just starting to investigate Hadoop and wants to introduce it to their organization. If you’ve already started implementing a project, please pay attention to part 2 where we discuss ways to increase your project’s chance for success.Also any feedback on your Hadoop selling experience will be valuable for authors
  3. [Mikey]Step one: Setting your goals.The first thing one needs to do is to identify what problems you want to solve. Create a ‘short list’ of the problems, and determine which problem is the best candidate.Look for a problem that can be solved in a m/r environment. Look for a problem where you’re not ‘betting the farm’, one where if you fail to deliver a solution on time and on budget, you’re not going to condemn Hadoop as an option for future projects.Create a problem statement which in plain English identifies the problem you are attempting to solve and some ‘boxing’ constraints which limit the scope of the problem.Once you have identified the problem you want to solve, you need to sell the problem and solution to your stakeholders. In selling the solution you want to focus on the solution itself and not the underlying technology. In this case,we are talking about Hadoop. Sure Hadoop is sexy and everyone wants to learn it… to pad their resume. But your stake holders don’t care about the technology, just that you have a potential solution which solves their problem and is cost effective. While we are here because we like Hadoop; and use Hadoop; remember that Hadoop is just a tool, its not a ‘cure all’ and perfect for every problem.If you’re at the stage that you know you want to use Hadoop, but you don’t know what sort of problems you need to solve, it time to identify the potential stakeholders, those who
  4. [Mikey] This leads us to our next point. When do we want to use Hadoop... What sort of problems do we think will be a good fit for Hadoop…, and what problems do we think would be better solved using a different tool… These are all questions that we have to think about before settling on a tool.[Boris will walk through slide]
  5. [Mikey]Part of the selling process is to first realize what you want to sell to ‘management’. You first have to set your goals and know what you want to gain from the project. (Besides learning how to work with a really cool tool and pad your resume… ) If you do not yet have a problem to solve, you may want to do some research and talk with your stakeholders.So… we set realistic goals… like processing X records per hour or Y incoming files... Some metric that you know you can beat and should really be obtainable. Once you have the project, the goals, you need to set boundaries. Like processing a specific stream of data only. Or only handling csv files and not XML input files. Once you’ve set your boundaries, if at all possible, you want to avoid scope creep noting that you can always add to the project after you get it working. Lock down the requirements at the start of the project.[Talk through points on slide…]
  6. [Mikey]There is a psychology to the selling process.At a high level.Even if you’ve done your homework and know the answer before presenting the solution, if you provide the answer too quickly, your stakeholders will suspect you and your solution.You have to listen to your stake holders, learn their ‘pain’ and determine the scope of the problem and what are the constraints to the problem. (Proposing a million dollar solution when there’s only 100,000 in the budget doesn’t help.) By listening to the stakeholders, you are showing them that you are crafting your solution to meet their needs and that when you present your solution you can address and re-affirm their pain.Once you have a rough idea, get estimates rather than relying on a SWAG.Its ok to say you don’t know something, make an action item and take the time to get the right answer.For the stakeholders they need to buy in to your solution, and to take ownership of the solution. They need to appreciate the underlying technology, its challenges and risks.
  7. [Mikey]When presenting the solution to the stakeholders, you need to have your ducks in a row. You need to re-affirm their stated pains, along with any latent pains you find while talking with the stakeholder’s team(s). This not only shows that you are presenting a solution but that your solution addresses their needs and starts the process of taking ownership for the solution.During the presentation you want to avoid ‘cutting to the chase’ or going straight to the bottom line of saying that it costs X dollars. By going straight to the bottom line, you don’t give the stakeholders and project leads time to digest the solution and to take ownership of the problem.Stakeholders typically don’t care about the underlying technology. They are more interested in finding a cost effective solution that solves their problem and can be modified as the business or environment changes. This is not to say that explaining the technology in the solution isn’t important, but your ‘sales’ success is going to be based on how well you meet their criteria for success. Does the solution meet their needs? What’s the time to value? Relative to other possible solutions, is this cost effective?There are some objections that you can’t overcome. In these situations, Hadoop isn’t a good fit and you should move on to the next potential problem to solve.
  8. Mark
  9. Boris + Mark
  10. Hadoop is an unregistered Trademark of Apache and is meant to refer to Apache’s release only. Any release which is not the official release from Apache would be a derivative work. Cloudera offers a derivative work which is free and also has commercial support.MapRTech has a derivative work that replaces the underlying HDFS with their own proprietary use of C++ and writing directly to the raw disk.Mikey, Boris Amazon
  11. The KISS principal has been around for ages. Regardless of your design methodology, you want to start off simple and build out. This allows one to learn the technology and work through the design challenges. When working through the software design, start by creating a simple English description of how you want to process the data and what you want to achieve in each step. This is useful when you need to go back to the SME/Business Analyst who may not be familiar with UML or a class diagram. (Its also a document that you can use to verify the other diagrams.)Boris, Mike - hardware
  12. Boris
  13. [mike] I am not sure what you want to say with this slide please add speaker notes![mark] One of your first tasks is setting your environment up, whether you go virtual or physical for your first project you will need to refer to documentation, do not stay too far from default configuration until you are comfortable or advised to do so. Use a tool like puppet or chef for configuration management. Additional configuration tips can be found in Mike Guenthers presentation and at docs.cloudera.comAs you develop features; you will need to address your data model. You will also be writing code to ingest data, process it, and display it. Keep in mind that these features will be part of your report on how you succeeded with hadoop.This is a multi level slide:1. You need reproducible environment. You can’t afford to rely on the manual tweaking every time you have to re install2.Without proper configuration your application will not work, Configuration is a two level process. Optimize your cluster to run well any job and the optimize your job for the cluster that you are using. Give an example of separating Hbase configuration from your table configuration3 Describe design stepsMark?
  14. Boris.
  15. Boris, Mike community
  16. Mark
  17. Mark
  18. Mark?
  19. All