SlideShare una empresa de Scribd logo
1 de 29
Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan  IBM T.J. Watson Research Center  HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parallel Matlab (pMatlab) ,[object Object],[object Object],[object Object],Library Layer (pMatlab) Vector/Matrix Comp Task Conduit Application Parallel Library Parallel Hardware Input Analysis  Output User Interface Hardware Interface Kernel Layer Math ( MATLAB/Octave ) Messaging ( MatlabMPI )
IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
Blue Gene Application Paths Serial and Pleasantly  Parallel Apps Highly Scalable Message Passing Apps High Throughput  Computing (HTC) High Performance  Computing (MPI) Blue Gene Environment ,[object Object],[object Object],[object Object]
HTC Node Modes on BG/P ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Porting pMatlab to BG/P System ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance Studies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release  Lower is better HPL MandelBrot ZoomImage* Beamformer Blurimage* FFT 26.4 s 11.9 s 18.8 s 6.1 s 1.5 s 5.2 s
Octave Performance With  Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab Triad a = b +  q  c Add c = a + c Scale b =  q  c
Point-to-Point Communication ,[object Object],[object Object],[object Object],Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2
Filesystem Consideration ,[object Object],[object Object],Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2 Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2
pSpeed Performance on LLGrid:  Mode S Higher is better
pSpeed Performance on LLGrid: Mode M Lower is better Higher is better
pSpeed Performance on BG/P BG/P Filesystem: GPFS
pStream Results with Scaled Size ,[object Object],[object Object],563 GB/sec 100 % Efficiency at Np = 1024
pStream Results with Fixed Size ,[object Object],[object Object],VN: 8.832 TB/Sec 101% efficiency  at Np=16384 DUAL: 4.333 TB/Sec 96% efficiency  at Np=8192 SMP: 2.208 TB/Sec 98% efficiency  at Np=4096
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object]
Current Aggregation Architecture ,[object Object],[object Object],[object Object],[object Object],Np = 8 Np: total number of processes 0 1 2 3 4 5 6 7 8
Binary-Tree Based Aggregation ,[object Object],[object Object],[object Object],Maximum number of message  a process may send/receive is  N, where Np = 2^(N) 0 1 2 3 4 5 6 7 0 2 4 6 0 4
BAGG() Performance ,[object Object],[object Object],[object Object],[object Object],IBRIX: 10x faster at Np=128 LUSTRE: 8x faster at Np=128
BAGG() Performance, 2 ,[object Object],[object Object],[object Object],2.5x faster at Np=1024
Generalizing Binary-Tree Based Aggregation ,[object Object],[object Object],[object Object],[object Object],0 1 2 3 4 5 6 7 0 2 4 6 0 4
BAGG() vs. HAGG() ,[object Object],[object Object],[object Object],[object Object],~3x faster at Np=128
BAGG() vs. HAGG(), 2 ,[object Object],[object Object],[object Object]
BAGG() Performance with Crossmounts ,[object Object],[object Object],Lower is better
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Más contenido relacionado

La actualidad más candente

Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Ganesan Narayanasamy
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 

La actualidad más candente (20)

Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 

Similar a pMatlab on BlueGene

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodesnagarajan_ka
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoSciCompIIT
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyBenoit des Ligneris
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesInderjeet Singh
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Profiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis ApplicationsProfiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis ApplicationsHargyo T. Nugroho
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
 
Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011balmanme
 

Similar a pMatlab on BlueGene (20)

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Inter Task Communication On Volatile Nodes
Inter Task Communication On Volatile NodesInter Task Communication On Volatile Nodes
Inter Task Communication On Volatile Nodes
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization Technology
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Profiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis ApplicationsProfiling And Optimization Of Software Base Network Analysis Applications
Profiling And Optimization Of Software Base Network Analysis Applications
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011
 

Último

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Último (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

pMatlab on BlueGene

  • 1. Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
  • 2.
  • 3.
  • 4. IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release Lower is better HPL MandelBrot ZoomImage* Beamformer Blurimage* FFT 26.4 s 11.9 s 18.8 s 6.1 s 1.5 s 5.2 s
  • 11. Octave Performance With Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
  • 12. Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab Triad a = b + q c Add c = a + c Scale b = q c
  • 13.
  • 14.
  • 15. pSpeed Performance on LLGrid: Mode S Higher is better
  • 16. pSpeed Performance on LLGrid: Mode M Lower is better Higher is better
  • 17. pSpeed Performance on BG/P BG/P Filesystem: GPFS
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.

Notas del editor

  1. 1 1 Title of the presentation to be presented at HPEC 2010, September 15-16, 2010.
  2. Outline of the presentation. Introduction will brief about motivation on porting pMatlab to mega-core systems such as IBM Blue Gene/P system and how we did it. Then performance details of the math kernel (octave) for pMatlab on BG/P is given. Also, we will present the technical details and performance improvement done for aggregation in order to work on a large scale computation.
  3. 1 1 pMatlab is the library layer and abstracts away the parallel distribution details so that the user calls pMatlab functions, which provide mechanisms for easily distributing and parallelizing Matlab code. pMatlab can use either MATLAB or Octave as the mathematics kernel, just as the messaging can accomplished with any library which can send and receive MATLAB messages. For our current implementation MatlabMPI is the communication kernel. pMatlab calls MatlabMPI functions to communicate data between processors.
  4. IBM Blue Gene/P system design diagram Each CPU runs at 850MHZ and may have 2GB or 4GB memory four cores. The Surveyor system at Argonne National Lab has 2GB memory, which was used for this study. One full rack has 1024 CPUs or 4096 compute cores.
  5. IBM Blue Gene/P system supportss two major computing environment. High Throughput Computing (HTC) environment is ideal for serial and pleasantly parallel applications. High Performance Computing environment targets for highly scalable message passing applications with MPI library. HTC environment fits well for porting pMatlab environment to BG/P.
  6. In Symmetric Multiprocessing (SMP) mode each physical Compute Node executes a single process per node with a default maximum of four threads. The process can access the full physical memory. With Dual and VN mode for HTC, only ½ and ¼ of the full memory is accessible by each process. For example, the Surveyor system has 2GB memory per compute node. So a process with SMP mode has approximately 2GB minus the space occupied by the OS. DUAL-mode process has about 1GB RAM and VN-mode process has about 512MB RAM.
  7. Porting pMatlab to BG/P system involves how to integrate BG/P job launch mechanism to the pMatlab job launch mechanism. It involves two steps to run pMatlab jobs on BG/P. First, it requests and boots a BG partition in HTC mode using the qsub command on Surveyor. Then, once the partition gets booted, execute the submit command to submit a job to the partition. By combining the two step under the pRUN command, pMatlab jobs can be executed on BG/P.
  8. Outline Slide for performance studies. We will talk about single process performance, point-to-point communication performance and scalability results using parallel stream benchmark.
  9. Examples used for the performance studies, which are available from the pMatlab package.
  10. Performance of example codes running on a single processor core is compared between Intel Xeon and IBM PowerPC chips
  11. Using an optimized BLAS library (ATLAS), Octave 3.2.4 binary was able to run faster than Matlab 2009a in DGEMM and HPL benchmark examples.
  12. Stream benchmark performance is compared between LLGrid and Surveyor (IBM BG/P) systems
  13. Point-to-point communication performance on BG/P is measured by using pSpeed, a pMatlab code example.
  14. Because messages are files in pMatlab, it is important to have efficient file system in order to scale for a large number of processes. In a single nfs-shared file system, this will have serious resource contention as the number of processes increase. However, by using a group of cross-mounted distributed file system, we may mitigate this issue. The next slide shows a demonstration of resource contention with file system.
  15. The pSpeed performance with MATLAB was measured when all messages were written on a single, nfs-shared disk. It started showing resource contention with four processes and with 8 processes, the issue is obvious.
  16. When using crossmounts for pMatlab message communication, the jittery behavior has been resolved. The overall performance of Octave matches well with that of Matlab when using four and eight processes.
  17. The point-to-point communication (pSpeed) on BG/P system performed as expected when using IBM’s proprietary parallel file system, GPFS
  18. Now we are looking at different perspective on scalability. We increase the problem size proportionally as we increase the number of processes so that each process has the amount of work everytime. With SMP mode, the initial global array size is set to 2^25 with Np=1 which is the largest size we can run on a single process. As scale the problem size proportionally with the number of processes, the bandwidth scales linearly. At Np=1024, Triad bandwidth is 563GB/sec with 100% parallel efficiency. At Np=1024, 1024 nodes and 1 processes per node
  19. pStream benchmark was able to scaled up to 16,384 processes on an IBM internal BG/P machine
  20. Outline Slide for aggregation optimization done for large scale computation
  21. The current aggregation architecture is easy to implement but it is inherently sequential. Therefore, the collection process will slow down significantly.
  22. Binary tree based aggregation can significantly reduce the number of messages that a process may send and receive although message itself becomes larger. For example, when Np =8, the leader process receives three messages with BAGG() while it receives 7 messages with AGG().
  23. On MIT Lincoln Laboratory cluster system, the performance of AGG() and BAGG() is compared for a 2-D data and process distribution using two different file systems. At large process numbers, BAGG() performs significantly better than AGG(). With 128 processes, BAGG() achieved 10x improvement on IBRIX and 8x improvement on LUSTRE file systems. The performance number is highly dependent on the file system loads since they were obtained on a production cluster.
  24. On IBM Blue Gene/P system, the performance comparison was done for a wide range of processes, from 8 to 1024 processes by using a 4-D data and process distribution. With GPFS file system, BAGG() runs 2.5x faster than AGG().
  25. BAGG() only works when Np is a power of two number. We generalized BAGG() so that it can work with any number of processes. The generalization is based on an extended binary tree so that we can still use the binary tree based aggregation architecture. There will be some additional costs associated with those fictitious processes. For those fictitious processes, we skip message communication.
  26. The costs associated with additional bookkeeping appears to be minimal. On a production MIT Lincoln Laboratory cluster, the performance of AGG(), BAGG() and HAGG() were compared. At Np = 128, BAGG() and HAGG() runs about 3x faster than AGG()
  27. On an IBM Blue Gene/P system, the cost of additional bookkeeping was studied on a wide range of processes using a 4-D data and process distribution using GPFS file system. Both performs almost identically through out the entire range of processes, from 8 to 512 processes.
  28. On LLGrid system with using a hybrid configuration (central file system for the lead process and crossmounts for compute nodes), the new agg() function can reduce resource contention on file system. Thus, it improved agg() performance significantly.
  29. Summary slide