[2024]Digital Global Overview Report 2024 Meltwater.pdf
Hadoop Successes and Failures to Drive Deployment Evolution
1. Hadoop Hands On
Successes and failures to drive
evolution
Benoit PERROUD
Software Engineer @Verisign & Apache Committer
GITI BigData, EPFL, November 6. 2012
2. Disclaimer
• I apologize for speaking “Frenglish”
• The views and statements expressed in this talk do not necessarily reflect the
views of VeriSign, Inc and any other person involved in the company do not
warrant the accuracy, reliability, currency or completeness of those views or
statements and do not accept any legal liability whatsoever arising from any
reliance on the views, statements and subject matter of the talk.
• Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache
Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache
Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either
registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries.
• Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its
affiliates
• Python and the Python logo are either registered trademarks or trademarks of the
Python Software Foundation
• MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc.
• All other marks are the property of their respective owners.
Verisign Public 2
5. Your first Hadoop Deployment
• Pseudo-distributed mode on a single node
Verisign Public 5
6. Going Distributed
• TaskTracker (TT) and DataNode (DN) is moved to a
dedicated box
Verisign Public 6
7. NameNode Single Point of Failure
• NameNode crashes. Configuring PNN and SNN
NFS HA setup is not detailed here.
Verisign Public 7
8. Bringing Data into the Cluster
• Data could be internal to the company, but also
external.
Data Retrieval and Stream Ingestion
are over simplified.
Verisign Public 8
9. Dealing with API Changes
• Integration/Validation Cluster setup
Validation Cluster will be omitted
in further slides for more clarity
Verisign Public 9
20. Future Evolutions
• Hadoop Next Gen
• YARN (2.0)
• Graph processing
• Neo4J
• Google Pregel / Apache Hama
• Incremental Updates
• Real time ad hoc queries
• Cloudera Impala / Google Dremel
Verisign Public 20
21. Conclusion
• Hadoop has gained huge momentum
• Technologies (around Hadoop) are evolving really fast
• There is no “One size fits all” solution
• Design hardly driven by customer needs
• Data quality is a hidden requirement
Verisign Public 21
22. Conclusion #2
• Data Scientists cost a lot
• Running on commodity hardware still costs a lot
• No one has the full understanding of the full data flow
• And you need several FTE just to track the architecture
• You have a high risk of misuse of these softwares
• Hiring engineers with deep knowledge (meaning:
hands on experience) in some of these softwares is
already a challenge
Verisign Public 22
23. Recommended Reading
Hadoop In Practice
by Alex Holmes
Senior Software Engineer @Verisign
Verisign Public 23
24. Q&A
Benoit PERROUD
bperroud@verisign.com
Verisign Public 24