SlideShare una empresa de Scribd logo
1 de 22
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop - Hadoop-on-Demand
on Traditional HPC Resources
Sriram Krishnan, Ph.D.
sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Acknowledgements
• MahidharTatineni
• ChaitanyaBaru
• Jim Hayes
• ShavaSmallen
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Outline
• Motivations
• Technical Challenges
• Implementation Details
• Performance Evaluation
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Motivations
• An open source tool for running Hadoop jobs on
HPC resources
• Easy to configure and use for the end-user
• Play nice with existing batch systems on HPC resources
• Why do we need such a tool?
• End-users: I already have Hadoop code – and I only have
access to regular HPC-style resources
• Computer Scientists: I want to study the implications of
using Hadoop on HPC resources
• And I don’t have root access to these resources
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Some Ground Rules
• What this presentation is:
• A “how-to” for running Hadoop jobs on HPC resources
using myHadoop
• A description of the performance implications of using
myHadoop
• What this presentation is not:
• A propaganda for the use of Hadoop on HPC resources
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Main Challenges
• Shared-nothing (Hadoop) versus HPC-style
architectures
• In terms of philosophies and implementation
• Control and co-existence of Hadoop and HPC
batch systems
• Typically both Hadoop and HPC batch systems
(viz., SGE, PBS) need completely control over the
resources for scheduling purposes
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Traditional HPC Architecture
PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH
MINIMAL LOCAL
STORAGE
Shared-nothing (MapReduce-style) Architectures
COMPUTE/DATA CLUSTER
WITH LOCAL STOARGE
ETHERNET
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Hadoop and HPC Batch Systems
• Access to HPC resources is typically via batch systems
– viz. PBS, SGE, Condor, etc
• These systems have complete control over the compute resources
• Users typically can’t log in directly to the compute nodes (via ssh) to
start various daemons
• Hadoop manages its resources using its own set of
daemons
• NameNode&DataNodefor Hadoop Distributed File System (HDFS)
• JobTracker&TaskTrackerfor MapReduce jobs
• Hadoop daemons and batch systems can’t co-exist
seamlessly
• Will interfere with each other’s scheduling algorithms
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
myHadoop Requirements
1. Enabling execution of Hadoop jobs on shared HPC
resources via traditional batch systems
a) Working with a variety of batch systems (PBS, SGE, etc)
2. Allowing users to run Hadoop jobs without needing
root-level access
3. Enabling multiple users to simultaneously execute
Hadoop jobs on the shared resource
4. Allowing users to either run a fresh Hadoop instance
each time (a), or store HDFS state for future runs (b)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
COMPUTE NODES
PERSISTENT MODE NON-PERSISTENT MODE
BATCH PROCESSING SYSTEM (PBS, SGE)
PARALLEL FILE SYSTEM
HADOOP DAEMONS
myHadoop Architecture
[2, 3]
[1]
[4(a)][4(b)]
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Implementation Details: PBS, SGE
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
User Workflow
BOOTSTRAP
TEARDOWN
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Performance Evaluation
• Goals and non-goals
• Study the performance overhead and implication of myHadoop
• Not to optimize/improve existing Hadoop code
• Software and Hardware
• Triton Compute Cluster (http://tritonresource.sdsc.edu/)
• Triton Data Oasis (Lustre-based parallel file system) for data storage, and for
HDFS in “persistent mode”
• Apache Hadoop version 0.20.2
• Various parameters tuned for performance on Triton
• Applications
• Compute-intensive: HadoopBlast (Indiana University)
• Modest-sized inputs – 128 query sequences (70K each)
• Compared against NR database – 200MB in size
• Data-intensive: Data Selections (OpenTopography Facility at SDSC)
• Input size from 1GB to 100GB
• Sub-selecting around 10% of the entire dataset
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
HadoopBlast
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Selections
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Related Work
• Recipe for running Hadoop over PBS in blogosphere
• http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using-
pbs.html
• myHadoop is “inspired” by their approach – but is more general-
purpose and configurable
• Apache Hadoop On Demand (HOD)
• http://hadoop.apache.org/common/docs/r0.17.0/hod.html
• Only PBS support, needs external HDFS, harder to use, and has
trouble with multiple concurrent Hadoop instances
• CloudBatch – batch queuing system on clouds
• Use of Hadoop to run batch systems like PBS
• Exact opposite of our goals – but similar approach
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Center for Large-Scale Data Systems Research (CLDS)
CLDS
Industry Advisory
Board
Academic Advisory
Board
Benchmarking,Perfor
mance Evaluation
and Systems
Development
Projects
Industry
Forums and
Professional
Education
Industry-University Consortium on Software for Large-scale Data
Systems
How Much
Information?
Project
Public
Private
Personal
Visiting Fellows
Information Metrology
Data Growth, Information Mgt
Cloud Storage
Architecture
Cloud Storage and
Performance Benchmarking
Industry Interchange
Mgt, Technical Forums
• Student internships
• Joint collaborations
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Summary
• myHadoop – an open source tool for running Hadoop
jobs on HPC resources
• Without need for root-level access
• Co-exists with traditional batch systems
• Allows “persistent” and “non-persistent” modes to save HDFS state
across runs
• Tested on SDSC Triton, TeraGrid and UC Grid resources
• More information
• Software: https://sourceforge.net/projects/myhadoop/
• SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR-
2011-2-Hadoop.pdf
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Questions?
• Email me at sriram@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Appendix
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
io.file.buffer.size 131072 Size of read/write buffer
fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs
io.sort.mb 650 Memory limit for sorting data
core-site.xml:
dfs.replication 2 Number of times data is replicated
dfs.block.size 134217728 HDFS block size in bytes
dfs.datanode.handler.count 64 Number of handlers to serve block requests
hdfs-site.xml:
mapred.reduce.parallel.copies 4 Number of parallel copies run by
reducers
mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously
mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously
mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks
mapred.child.java.opts -Xmx1024m Large heap size for child JVMs
hdfs-site.xml:
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Data SelectCounts on Dash

Más contenido relacionado

La actualidad más candente

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 

La actualidad más candente (20)

Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
ebay
ebayebay
ebay
 
Never late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARNNever late again! Job-Level deadline SLOs in YARN
Never late again! Job-Level deadline SLOs in YARN
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 

Similar a myHadoop - Hadoop-on-Demand on Traditional HPC Resources

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
inside-BigData.com
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 

Similar a myHadoop - Hadoop-on-Demand on Traditional HPC Resources (20)

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Foxvalley bigdata
Foxvalley bigdataFoxvalley bigdata
Foxvalley bigdata
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data analytics_using_hadoop
Big data analytics_using_hadoopBig data analytics_using_hadoop
Big data analytics_using_hadoop
 
Hadoop
Hadoop Hadoop
Hadoop
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

myHadoop - Hadoop-on-Demand on Traditional HPC Resources

  • 1. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop - Hadoop-on-Demand on Traditional HPC Resources Sriram Krishnan, Ph.D. sriram@sdsc.edu
  • 2. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Acknowledgements • MahidharTatineni • ChaitanyaBaru • Jim Hayes • ShavaSmallen
  • 3. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Outline • Motivations • Technical Challenges • Implementation Details • Performance Evaluation
  • 4. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Motivations • An open source tool for running Hadoop jobs on HPC resources • Easy to configure and use for the end-user • Play nice with existing batch systems on HPC resources • Why do we need such a tool? • End-users: I already have Hadoop code – and I only have access to regular HPC-style resources • Computer Scientists: I want to study the implications of using Hadoop on HPC resources • And I don’t have root access to these resources
  • 5. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Some Ground Rules • What this presentation is: • A “how-to” for running Hadoop jobs on HPC resources using myHadoop • A description of the performance implications of using myHadoop • What this presentation is not: • A propaganda for the use of Hadoop on HPC resources
  • 6. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Main Challenges • Shared-nothing (Hadoop) versus HPC-style architectures • In terms of philosophies and implementation • Control and co-existence of Hadoop and HPC batch systems • Typically both Hadoop and HPC batch systems (viz., SGE, PBS) need completely control over the resources for scheduling purposes
  • 7. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Traditional HPC Architecture PARALLEL FILE SYSTEM COMPUTE CLUSTER WITH MINIMAL LOCAL STORAGE Shared-nothing (MapReduce-style) Architectures COMPUTE/DATA CLUSTER WITH LOCAL STOARGE ETHERNET
  • 8. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Hadoop and HPC Batch Systems • Access to HPC resources is typically via batch systems – viz. PBS, SGE, Condor, etc • These systems have complete control over the compute resources • Users typically can’t log in directly to the compute nodes (via ssh) to start various daemons • Hadoop manages its resources using its own set of daemons • NameNode&DataNodefor Hadoop Distributed File System (HDFS) • JobTracker&TaskTrackerfor MapReduce jobs • Hadoop daemons and batch systems can’t co-exist seamlessly • Will interfere with each other’s scheduling algorithms
  • 9. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO myHadoop Requirements 1. Enabling execution of Hadoop jobs on shared HPC resources via traditional batch systems a) Working with a variety of batch systems (PBS, SGE, etc) 2. Allowing users to run Hadoop jobs without needing root-level access 3. Enabling multiple users to simultaneously execute Hadoop jobs on the shared resource 4. Allowing users to either run a fresh Hadoop instance each time (a), or store HDFS state for future runs (b)
  • 10. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO COMPUTE NODES PERSISTENT MODE NON-PERSISTENT MODE BATCH PROCESSING SYSTEM (PBS, SGE) PARALLEL FILE SYSTEM HADOOP DAEMONS myHadoop Architecture [2, 3] [1] [4(a)][4(b)]
  • 11. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Implementation Details: PBS, SGE
  • 12. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO User Workflow BOOTSTRAP TEARDOWN
  • 13. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Performance Evaluation • Goals and non-goals • Study the performance overhead and implication of myHadoop • Not to optimize/improve existing Hadoop code • Software and Hardware • Triton Compute Cluster (http://tritonresource.sdsc.edu/) • Triton Data Oasis (Lustre-based parallel file system) for data storage, and for HDFS in “persistent mode” • Apache Hadoop version 0.20.2 • Various parameters tuned for performance on Triton • Applications • Compute-intensive: HadoopBlast (Indiana University) • Modest-sized inputs – 128 query sequences (70K each) • Compared against NR database – 200MB in size • Data-intensive: Data Selections (OpenTopography Facility at SDSC) • Input size from 1GB to 100GB • Sub-selecting around 10% of the entire dataset
  • 14. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO HadoopBlast
  • 15. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data Selections
  • 16. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Related Work • Recipe for running Hadoop over PBS in blogosphere • http://jaliyacgl.blogspot.com/2008/08/hadoop-as-batch-job-using- pbs.html • myHadoop is “inspired” by their approach – but is more general- purpose and configurable • Apache Hadoop On Demand (HOD) • http://hadoop.apache.org/common/docs/r0.17.0/hod.html • Only PBS support, needs external HDFS, harder to use, and has trouble with multiple concurrent Hadoop instances • CloudBatch – batch queuing system on clouds • Use of Hadoop to run batch systems like PBS • Exact opposite of our goals – but similar approach
  • 17. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Center for Large-Scale Data Systems Research (CLDS) CLDS Industry Advisory Board Academic Advisory Board Benchmarking,Perfor mance Evaluation and Systems Development Projects Industry Forums and Professional Education Industry-University Consortium on Software for Large-scale Data Systems How Much Information? Project Public Private Personal Visiting Fellows Information Metrology Data Growth, Information Mgt Cloud Storage Architecture Cloud Storage and Performance Benchmarking Industry Interchange Mgt, Technical Forums • Student internships • Joint collaborations
  • 18. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Summary • myHadoop – an open source tool for running Hadoop jobs on HPC resources • Without need for root-level access • Co-exists with traditional batch systems • Allows “persistent” and “non-persistent” modes to save HDFS state across runs • Tested on SDSC Triton, TeraGrid and UC Grid resources • More information • Software: https://sourceforge.net/projects/myhadoop/ • SDSC Tech Report: http://www.sdsc.edu/pub/techreports/SDSC-TR- 2011-2-Hadoop.pdf
  • 19. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Questions? • Email me at sriram@sdsc.edu
  • 20. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Appendix
  • 21. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO io.file.buffer.size 131072 Size of read/write buffer fs.inmemory.size.mb 650 Size of in-memory FS for merging outputs io.sort.mb 650 Memory limit for sorting data core-site.xml: dfs.replication 2 Number of times data is replicated dfs.block.size 134217728 HDFS block size in bytes dfs.datanode.handler.count 64 Number of handlers to serve block requests hdfs-site.xml: mapred.reduce.parallel.copies 4 Number of parallel copies run by reducers mapred.tasktracker.map.tasks.maximum 4 Max map tasks to run simultaneously mapred.tasktracker.reduce.tasks.maximum 2 Max reduce tasks to run simultaneously mapred.job.reuse.jvm.num.tasks 1 Reuse the JVM between tasks mapred.child.java.opts -Xmx1024m Large heap size for child JVMs hdfs-site.xml:
  • 22. SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Data SelectCounts on Dash

Notas del editor

  1. Mahidhar, Chaitan: Architecture, prototypingJim: myHadoop rollShava: UC Grid support
  2. Motivations – why myHadoopTechnical challenges – what are the problemsImplementation Details – howPerformance evaluation - findings
  3. Note: we didn’t make up these requirements. Came out of our existing requirements as end-users and computer scientistsMost of us have access to resources such as the TeraGrid, UC Grid, SDSC TritonI have no official affiliation with any of those resources – but I had access to these resources, and wanted to use them for performance studies
  4. Co-location of data and computes in shared-nothing – no centralized shared storage in the Hadoop modelHigh performance parallel file systems for HPC resources
  5. Scheduler access2, 3) Non-root concurrent users4) Persistent and non-persistent modes
  6. Data loads not very dominant – more CPU-intensivePerformance slightly better on local disk – more contention on Oasis, and also that Lustre is optimized for large files, not lots of smaller ones
  7. Data loads are the dominating factor for the non-persistent runsMakes more sense to leave the data on shared file system – and use that at the HDFS locationOutput writes are also time consuming in this case – so might as well leave the data in HDFS for future runs