SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
A Buffering Approach to Manage I/O in a
Normalized Cross-Correlation Earthquake
Detection Code for Large Seismic Datasets
Dawei Mu, Pietro Cicotti, Yifeng Cui, Enjui Lee, Po Chen
Outlines
1. Introduction of cuNCC code
2. Realistic Application
3. Performance Analysis
4. Memory Buffer Approach and I/O Analysis
5. Future Work
1. Introduction of cuNCC
what is cuNCC ?
CUDA based software designed to calculate the normalized cross-
correlation coefficient (NCC) between a collection of selected
template waveforms and the continuous waveform recordings of
seismic instruments to evaluate the waveform similarity among the
waveforms and/or the relative travel-time differences.
Feb 05, 2016 M6.6

Meinong aftershocks detection
• more uncatalogued aftershocks

were detected
• contributed to earthquake 

location detection and earthquake 

source parameters estimation
2. Realistic Application
M6.6 Meinong aftershocks 

hypocenter re-location.
•traditional method using short term 

long term to detect events and using 

1-D model to locate hypocenter. fewer

aftershock detected, the result contains

less information due to inaccuracy.
•3-D waveform template matching detect

events, using 3D model and waveform 

travel-time differences to re-locate the 

hypocenter, the result shows more 

events and more clustered hypocenters, 

which give us detailed fault geometry
•over 4 trillion NCC calculations involved
3. Performance Analysis
optimization scheme
• The cuNCC is bounded by 

memory bandwidth
• The constant memory is used to 

stack multiple events into 

single computational kernel
• The shared memory is used to improve the memory bandwidth 

utilization
1. Compute, Bandwidth, or Latency Bound
The first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by com
bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "cuNCC_04
limited by memory bandwidth. You should first examine the information in the "Memory Bandwidth" section to d
limiting performance.
1.1. Kernel Performance Is Bound By Memory Bandwidth
For device "GeForce GTX 980" the kernel's compute utilization is significantly lower than its memory utilization
levels indicate that the performance of the kernel is most likely being limited by the memory system. For this kern
factor in the memory system is the bandwidth of the Shared memory.
3. Performance Analysis
performance benchmark
This cuNCC code can achieve high performance without high-end
hardware or expensive clusters, optimized CPU based NCC code
needs 21 hours for one E7-8867 CPU (all 18 cores) to finish
mentioned example while a NVIDIA GTX980 only costs 53 minutes.
Hardware Runtime (ms)
SP FLOP
(×1e11)
Achieved
GFLOPS
Max GFLOPS
Achieved

Percentage
Speedup
E7-8867

(18 cores)
2968 1.23 41.36 237.6 17.4% 1.0x
C2075

(Fermi)
495 1.8 363.83 1030 35.3% 6.0x
GTX980

(Maxwell)
116 1.8 1552.80 5000 31.0% 25.6x
M40

(Maxwell)
115 1.8 1569.86 7000 22.4% 25.8x
P100

(Pascal)
62 1.8 2911.84 10600 27.5% 47.9x
4. Memory Buffer Approach and I/O Analysis
the I/O bottleneck
After improving the computational 

performance with GPU acceleration, 

I/O efficiency became the new 

bottleneck of the cuNCC’s overall 

performance.
The output file of cuNCC is an 1-D vector of similarity coefficients
saved in binary format, which size is equal to seismic data file.
CPU NCC I/O operations cost roughly 10% of the total runtime, while
the GPU code I/O cost more than 75% of total runtime.
0
125
250
375
500
NCC(CPU) cuNCC
I/O Compute
4. Memory Buffer Approach and I/O Analysis
test environment
The SGI UV300 system has 8 sockets 18 core Intel Xeon E7-8867 V4
processors 16 DDR4 32GB DRAM run at 1600 MHz.
4 TB of DRAM in Non-Uniform Memory Access (NUMA) via SGI’s
NUMALink technology.
4x PCIe Intel flash cards for a total of 8TB configured as a RAID 0
device and mounted as “/scratch” with a 975.22 MB/s achieved I/O
bandwidth (with IOR).
2x 400GB Intel SSDs configured as a RAID 1 device and mounted as
“/home” with a 370.07 MB/s achieved I/O bandwidth.
The software we used were GCC-4.4.7 and CUDA-7.5 along with MPI
package MPICH-3.2.
4. Memory Buffer Approach and I/O Analysis
use CPU memory as a buffer
•Most GPU-enabled computers have more CPU

memory than GPU memory. 

(in our case 48GB << 4TB)
•Fixed data chuck size (120 days’) 

with different total workloads
•on ”/scratch” partition, for every data size, the 

buffering technique costs more overall runtime 

than the no-buffering
•on ”/home” partition, buffering starts to help 

after reaching the 2400-day’s total workload
•the high I/O bandwidth filesystem, the 

improvement brought by the buffering cannot 

cover up the overhead of the memory transfer.
4. Memory Buffer Approach and I/O Analysis
use shared memory virtual 

filesystem as a buffer
•we set 2 TB of DRAM as a shared memory 

virtual filesystem, and measured the I/O 

bandwidth achieved 2228.05 MB/s.
•on the ”/dev/shm” partition, the high 

bandwidth of shared memory improves 

performance greatly by reducing the 

time used for output.
•we gathered the runtime result without 

buffering scheme from all three storage 

partitions, and the shared memory 

partition obtains the best performance.
4. Memory Buffer Approach and I/O Analysis
I/O test conclusion
•for machines support shared memory virtual filesystem, we
recommend to use the shared memory as buffer to output cuNCC
result, especially when the similarity coefficients are the median
result for the following computation.
•for those machines do not have shared memory with high bandwidth
I/O device, we recommend to directly output the result to storage
without the buffering scheme.
•for those machines do not support shared memory with low
bandwidth I/O device, we should consider to use CPU memory as a
buffer to reduce disk access frequency.
5. Future Work
• further optimize the cuNCC code on the Pascal GPU platform.
• implement our cuNCC code with “SEISM-IO” library, which interface
allows user to switch among “MPI-IO”, “PHDF5”, “NETCDF4”, and
“ADIOS” as low level I/O libraries.
Thank you for your time !

Más contenido relacionado

La actualidad más candente

User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & MonitoringJohn Paulett
 
Stop-the-world GCs on milticores
Stop-the-world GCs on milticoresStop-the-world GCs on milticores
Stop-the-world GCs on milticoresAliya Ibragimova
 
AES encryption on modern consumer architectures
AES encryption on modern consumer architecturesAES encryption on modern consumer architectures
AES encryption on modern consumer architecturesGrigore Lupescu
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudHelix Nebula The Science Cloud
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cachergrebski
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsTsung-en Hsiao
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
Hioki 8860 memory_hi_corder_datasheet
Hioki 8860 memory_hi_corder_datasheetHioki 8860 memory_hi_corder_datasheet
Hioki 8860 memory_hi_corder_datasheetAngus Sankaran
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)Pekka Männistö
 

La actualidad más candente (20)

Final Thesis
Final ThesisFinal Thesis
Final Thesis
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & Monitoring
 
Inp tooptimmempropslinkedinpost 25mar18_004
Inp tooptimmempropslinkedinpost 25mar18_004Inp tooptimmempropslinkedinpost 25mar18_004
Inp tooptimmempropslinkedinpost 25mar18_004
 
Stop-the-world GCs on milticores
Stop-the-world GCs on milticoresStop-the-world GCs on milticores
Stop-the-world GCs on milticores
 
AES encryption on modern consumer architectures
AES encryption on modern consumer architecturesAES encryption on modern consumer architectures
AES encryption on modern consumer architectures
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Hioki 8860 memory_hi_corder_datasheet
Hioki 8860 memory_hi_corder_datasheetHioki 8860 memory_hi_corder_datasheet
Hioki 8860 memory_hi_corder_datasheet
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)LizardFS-WhitePaper-Eng-v4.0 (1)
LizardFS-WhitePaper-Eng-v4.0 (1)
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 

Similar a A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel acceleratorBaharJV
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
computer system embedded system volume1.ppt
computer system embedded system volume1.pptcomputer system embedded system volume1.ppt
computer system embedded system volume1.pptmshanajoel6
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Odinot Stanislas
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeMemory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane AJAY KHARAT
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesAmazon Web Services
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 

Similar a A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets (20)

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
computer system embedded system volume1.ppt
computer system embedded system volume1.pptcomputer system embedded system volume1.ppt
computer system embedded system volume1.ppt
 
4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf4.1 Introduction 145• In this section, we first take a gander at a.pdf
4.1 Introduction 145• In this section, we first take a gander at a.pdf
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
Operating System
Operating SystemOperating System
Operating System
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-planeMemory and Performance Isolation for a Multi-tenant Function-based Data-plane
Memory and Performance Isolation for a Multi-tenant Function-based Data-plane
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

  • 1. A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets Dawei Mu, Pietro Cicotti, Yifeng Cui, Enjui Lee, Po Chen
  • 2. Outlines 1. Introduction of cuNCC code 2. Realistic Application 3. Performance Analysis 4. Memory Buffer Approach and I/O Analysis 5. Future Work
  • 3. 1. Introduction of cuNCC what is cuNCC ? CUDA based software designed to calculate the normalized cross- correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments to evaluate the waveform similarity among the waveforms and/or the relative travel-time differences. Feb 05, 2016 M6.6
 Meinong aftershocks detection • more uncatalogued aftershocks
 were detected • contributed to earthquake 
 location detection and earthquake 
 source parameters estimation
  • 4. 2. Realistic Application M6.6 Meinong aftershocks 
 hypocenter re-location. •traditional method using short term 
 long term to detect events and using 
 1-D model to locate hypocenter. fewer
 aftershock detected, the result contains
 less information due to inaccuracy. •3-D waveform template matching detect
 events, using 3D model and waveform 
 travel-time differences to re-locate the 
 hypocenter, the result shows more 
 events and more clustered hypocenters, 
 which give us detailed fault geometry •over 4 trillion NCC calculations involved
  • 5. 3. Performance Analysis optimization scheme • The cuNCC is bounded by 
 memory bandwidth • The constant memory is used to 
 stack multiple events into 
 single computational kernel • The shared memory is used to improve the memory bandwidth 
 utilization 1. Compute, Bandwidth, or Latency Bound The first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by com bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "cuNCC_04 limited by memory bandwidth. You should first examine the information in the "Memory Bandwidth" section to d limiting performance. 1.1. Kernel Performance Is Bound By Memory Bandwidth For device "GeForce GTX 980" the kernel's compute utilization is significantly lower than its memory utilization levels indicate that the performance of the kernel is most likely being limited by the memory system. For this kern factor in the memory system is the bandwidth of the Shared memory.
  • 6. 3. Performance Analysis performance benchmark This cuNCC code can achieve high performance without high-end hardware or expensive clusters, optimized CPU based NCC code needs 21 hours for one E7-8867 CPU (all 18 cores) to finish mentioned example while a NVIDIA GTX980 only costs 53 minutes. Hardware Runtime (ms) SP FLOP (×1e11) Achieved GFLOPS Max GFLOPS Achieved
 Percentage Speedup E7-8867
 (18 cores) 2968 1.23 41.36 237.6 17.4% 1.0x C2075
 (Fermi) 495 1.8 363.83 1030 35.3% 6.0x GTX980
 (Maxwell) 116 1.8 1552.80 5000 31.0% 25.6x M40
 (Maxwell) 115 1.8 1569.86 7000 22.4% 25.8x P100
 (Pascal) 62 1.8 2911.84 10600 27.5% 47.9x
  • 7. 4. Memory Buffer Approach and I/O Analysis the I/O bottleneck After improving the computational 
 performance with GPU acceleration, 
 I/O efficiency became the new 
 bottleneck of the cuNCC’s overall 
 performance. The output file of cuNCC is an 1-D vector of similarity coefficients saved in binary format, which size is equal to seismic data file. CPU NCC I/O operations cost roughly 10% of the total runtime, while the GPU code I/O cost more than 75% of total runtime. 0 125 250 375 500 NCC(CPU) cuNCC I/O Compute
  • 8. 4. Memory Buffer Approach and I/O Analysis test environment The SGI UV300 system has 8 sockets 18 core Intel Xeon E7-8867 V4 processors 16 DDR4 32GB DRAM run at 1600 MHz. 4 TB of DRAM in Non-Uniform Memory Access (NUMA) via SGI’s NUMALink technology. 4x PCIe Intel flash cards for a total of 8TB configured as a RAID 0 device and mounted as “/scratch” with a 975.22 MB/s achieved I/O bandwidth (with IOR). 2x 400GB Intel SSDs configured as a RAID 1 device and mounted as “/home” with a 370.07 MB/s achieved I/O bandwidth. The software we used were GCC-4.4.7 and CUDA-7.5 along with MPI package MPICH-3.2.
  • 9. 4. Memory Buffer Approach and I/O Analysis use CPU memory as a buffer •Most GPU-enabled computers have more CPU
 memory than GPU memory. 
 (in our case 48GB << 4TB) •Fixed data chuck size (120 days’) 
 with different total workloads •on ”/scratch” partition, for every data size, the 
 buffering technique costs more overall runtime 
 than the no-buffering •on ”/home” partition, buffering starts to help 
 after reaching the 2400-day’s total workload •the high I/O bandwidth filesystem, the 
 improvement brought by the buffering cannot 
 cover up the overhead of the memory transfer.
  • 10. 4. Memory Buffer Approach and I/O Analysis use shared memory virtual 
 filesystem as a buffer •we set 2 TB of DRAM as a shared memory 
 virtual filesystem, and measured the I/O 
 bandwidth achieved 2228.05 MB/s. •on the ”/dev/shm” partition, the high 
 bandwidth of shared memory improves 
 performance greatly by reducing the 
 time used for output. •we gathered the runtime result without 
 buffering scheme from all three storage 
 partitions, and the shared memory 
 partition obtains the best performance.
  • 11. 4. Memory Buffer Approach and I/O Analysis I/O test conclusion •for machines support shared memory virtual filesystem, we recommend to use the shared memory as buffer to output cuNCC result, especially when the similarity coefficients are the median result for the following computation. •for those machines do not have shared memory with high bandwidth I/O device, we recommend to directly output the result to storage without the buffering scheme. •for those machines do not support shared memory with low bandwidth I/O device, we should consider to use CPU memory as a buffer to reduce disk access frequency.
  • 12. 5. Future Work • further optimize the cuNCC code on the Pascal GPU platform. • implement our cuNCC code with “SEISM-IO” library, which interface allows user to switch among “MPI-IO”, “PHDF5”, “NETCDF4”, and “ADIOS” as low level I/O libraries.
  • 13. Thank you for your time !