SlideShare una empresa de Scribd logo
1 de 33
Exploring Parallel Merging
Technique In GPU Based System
Presented By
Twasif F. Rahman
ID: 2011-3-60-005
&
Md. Rakib Bahadur
ID:2011- 3 60 025
Objectives
• Understand the GPU architecture implementing Adaptive Merge Sort
algorithm
• Minimize execution time
• Draw a standard design so that the oversized data might not retaliate in
long run by adding overhead to the total execution time
– Reducing communication overheard (cpu<->gpu)
What is GPGPU
• General Purpose Graphics Processing Unit (GPGPU)
• Very high performance at low cost
– 30-100X speedup over CPU
• Architecture well suited for myriads of parallel applications
– data parallel processing (SIMD/SPMD)
• Integrated programmable unit
SP SP SP SP SP
SP SP SP SP SP
SP SP SP SP SP
LD/ST
LD/ST
LD/ST
Architecture of GPU- GTX 650
SFU
SFU
SFU
• SP or streaming processor (core)-192
• LD/ST loads and stores
• SFU handles cos() , sin () , log() , exp () , sqrt()
Architecture of GPU- GTX 650
• Registers hold value , 65536 at max , fast memory access
• Warp Scheduler unites groups of instructions (threads) and dispatches them
to the SPs through Dispatch Unit.
Shared memory / L1 Cache(read-only)
Architecture of GPU- GTX 650
Application: Adaptive Merge Sort [1]
• Works in three (3) steps
Partitioning data set into sub lists or
nodes based on order
Formulate the nodes in ascending
order
Merge all nodes
Serial Implementation
• Experiment with: N = 7,168 random numbers
• After partitioning:
– total P= 2968 nodes with 1491 nodes in descending order
– All descending order nodes converted to ascending order
• Then the one step of merging two nodes commenced in three (3) steps:
– determine the node, in which the selected value resides
– determine the position in newly merged node and
– update old nodes information after introduction of a newly merged node
– The merging process will repeat for every data item on each level of merging
until all fragments are converged into one.
How Many Levels
Merging to be recurred equal to the number of the merge tree height. As we got
2968 nodes the height must be 11.
So, the merge function would be called (11×7168) = 78,848 times.
The total time for calculation = 0.161 sec.
Merge function needs time for execute once = 2.04 µs (in theoretical)
Bottle neck
Q: What will happen to merge billions of data with millions of node?
Answer: The merge function has to be called billions of times and would
require more than hours to calculate.
Necessity of parallel computation
execute multiple merging operations at the
same stride of time (in parallel) and reduce
consumption time
Implementation In GPU
– Threads
– Blocks and Grids
– Kernel
Blocks and Grids
Implementation In GPU
Kernel
• CUDA C allows programmer to define C functions, called kernel, that will
execute
• times equal to thread numbers specified when called under Host or device function.
– Kernel function is defined by using __global__ declaration identifier
– Number of threads and blocks specified inside “<<<…>>>”execution
configuration syntax
Partitioning nodes
0
7
1
3
2
4
3
6
4
9
5
15
6
13
7
9
8
0
9
2
10
2
11
3
Node 1
Node 0 Node 2
Node 3
Reversing Descending nodes
7 3 13 9 0
Node 0 Node 2
THREAD 0 THREAD 1
Thread 0
3
0
7
1
4
2
6
3
9
4
15
5
0
6
9
7
13
8
2
9
2
10
3
11
New position = Position in own node of the data in current index + Number
of data in high node <= the data in current index
Node 0 Node 1 Node 2 Node 3
Thread 0
0 1 2 3 4 5
3
0
7
1
4
2
6
3
9
4
15
5
0
6
9
7
13
8
2
9
2
10
3
11
Node 0 Node 1 Node 2 Node 3
Node 0
Thread 2
3
0
7
1
4
2
6
3
9
4
15
5
0
6
9
7
13
8
2
9
2
10
3
11
New position = Position in own node of the data in current index + number
of data in high node is less than the data in current index
Node 0 Node 1 Node 2 Node 3
Thread 2
3
0
7
1
4
2
6
3
9
4
15
5
0 1 2 3 4 5
0
6
9
7
13
8
2
9
2
10
3
11
Node 0 Node 1 Node 2 Node 3
Node 0
Parallel implementation
After all 12 thread run in parallel (After 1st level merging)
0
6
2
7
2
8
3
9
9
10
13
11
Node 1
After 2nd level merging
3
0
4
1
6
2
7
3
9
4
15
5
Node 0
6
6
7
7
9
8
9
9
13
10
13
11
0
0
2
1
2
2
3
3
3
4
4
5
Node 0
Data GPU CPU
4096 0.006472 0.031
7168 0.013 0.21
10240 0.0258 0.205
13312 0.0298 0.345
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4096 7168 10240 13312
CPU
GPU
0
0.1
0.2
0.3
0.4
4096 7168 1024013312
CPU
GPU
GPU vs CPU time comparison
Static Gridsize and Variable Blocksize
Nearly similar execution time for equal Gridsize and Variable Blocksize (max 1024)
0
5
10
15
20
25
1024 X 4 1024 X 8 1024 X 16 1024 X 32 1024 X 64
Time of execution (microseconds)
Time of execution
(microseconds)
Variable Gridsize and Static Blocksize
Avg Time per iteration = 1.011 micro seconds
A Best Fit Situation
A Worst Fit Situation
Best Fit vs. Worst Fit
Execution Time Analysis
• Parallel Block # (PB) = Max Thread per block/ User Defined
BLOCKSIZE
• PB<16
– Parallel Code Loop# = User defined GRIDSIZE/PB*# of
SMX
• PB> 16
– Parallel Code Loop# = User defined GRIDSIZE/16*# of
SMX
0
10
20
30
40
50
60
70
2 2 2 2 2 4 8 16 32 64
Theroritical Execution
Time
Actual Execution Time
Error
Theory vs. Reality
DATA/Iterations Theroritical Execution Time Actual Execution Time Error
1024*4(2 iter) 2.022 3.3 0.38
512*8(2 iter) 2.022 3.3 0.38
256*16(2 iter) 2.022 3.3 0.38
128*32(2 iter) 2.022 3.3 0.38
64*64(2 iter) 2.022 4.192 0.52
32*128(4 iter) 4.044 5.631 0.28
16*256(8 iter) 8.088 8.652 0.065
8*512(16 iter) 16.176 14.8034 0.0927
4*1024(32 iter) 32.352 26.8096 0.2
2*2048(64 iter) 64.704 50.483 0.28
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
4K 7K 10K 13K
Execution
Time
Memory copy
Time
0%
20%
40%
60%
80%
100%
4k 7k 10k 13k
Memcopy
time
Exexution
Time
GPU Exec Time vs. Mem Transfer Time
Data Execution Time Memory copy Time
4K 0.006472 0.0725
7K 0.013 0.0825
10K 0.0258 0.0975
13K 0.0298 0.125
Conclusion
• A successful investigation
• GPU’s calculation prowess should be harnessed to solve more
merging problems
• Examples and the deign should be followed to get upper hand
before a problem is approached
Future Works
• Interpolation Merge Sort
• More efficiency and better memory handling in updated work
• Grid Level parallelism (requires multiple GPU)
References
• [1] Shamim Akhter et.al., 2010, Sorting N-
elements Using Natural Order: A New Adaptive
Sorting Algorithm, Journal of Computer Science
6 (2): 163-167.
THANK YOU FOR
YOUR PATIENCE
Q/A

Más contenido relacionado

La actualidad más candente

ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019UA DevOps Conference
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Altinity Ltd
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to PriamJason Brown
 
The Weather of the Century Part 2: High Performance
The Weather of the Century Part 2: High PerformanceThe Weather of the Century Part 2: High Performance
The Weather of the Century Part 2: High PerformanceMongoDB
 
Weather of the Century: Design and Performance
Weather of the Century: Design and PerformanceWeather of the Century: Design and Performance
Weather of the Century: Design and PerformanceMongoDB
 
Cassandra at talkbits
Cassandra at talkbitsCassandra at talkbits
Cassandra at talkbitsMax Alexejev
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuningMongoDB
 
CS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved ExamplesCS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved Examplesramyaranjith
 
Teaching PostgreSQL to new people
Teaching PostgreSQL to new peopleTeaching PostgreSQL to new people
Teaching PostgreSQL to new peopleTomek Borek
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 

La actualidad más candente (18)

ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
 
An Introduction to Priam
An Introduction to PriamAn Introduction to Priam
An Introduction to Priam
 
The Quantum Physics of Java
The Quantum Physics of JavaThe Quantum Physics of Java
The Quantum Physics of Java
 
The Weather of the Century Part 2: High Performance
The Weather of the Century Part 2: High PerformanceThe Weather of the Century Part 2: High Performance
The Weather of the Century Part 2: High Performance
 
Spanner (may 19)
Spanner (may 19)Spanner (may 19)
Spanner (may 19)
 
The Internet
The InternetThe Internet
The Internet
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Weather of the Century: Design and Performance
Weather of the Century: Design and PerformanceWeather of the Century: Design and Performance
Weather of the Century: Design and Performance
 
Cassandra at talkbits
Cassandra at talkbitsCassandra at talkbits
Cassandra at talkbits
 
1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Disruptor
DisruptorDisruptor
Disruptor
 
CS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved ExamplesCS6401 Operating systems - Solved Examples
CS6401 Operating systems - Solved Examples
 
Teaching PostgreSQL to new people
Teaching PostgreSQL to new peopleTeaching PostgreSQL to new people
Teaching PostgreSQL to new people
 
Google Spanner
Google SpannerGoogle Spanner
Google Spanner
 
doc
docdoc
doc
 
#2 (UDP)
#2 (UDP)#2 (UDP)
#2 (UDP)
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 

Similar a Exploring Parallel Merging In GPU Based Systems Using CUDA C.

BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performanceahl0003
 
2021 10-13 i ox query processing
2021 10-13 i ox query processing2021 10-13 i ox query processing
2021 10-13 i ox query processingAndrew Lamb
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stackHajime Tazaki
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfgmdvmk
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPFIvan Babrou
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 

Similar a Exploring Parallel Merging In GPU Based Systems Using CUDA C. (20)

Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
BWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentation
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
OpenZFS data-driven performance
OpenZFS data-driven performanceOpenZFS data-driven performance
OpenZFS data-driven performance
 
2021 10-13 i ox query processing
2021 10-13 i ox query processing2021 10-13 i ox query processing
2021 10-13 i ox query processing
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
osdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdfosdi23_slides_lo_v2.pdf
osdi23_slides_lo_v2.pdf
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Exploring Parallel Merging In GPU Based Systems Using CUDA C.

  • 1. Exploring Parallel Merging Technique In GPU Based System Presented By Twasif F. Rahman ID: 2011-3-60-005 & Md. Rakib Bahadur ID:2011- 3 60 025
  • 2. Objectives • Understand the GPU architecture implementing Adaptive Merge Sort algorithm • Minimize execution time • Draw a standard design so that the oversized data might not retaliate in long run by adding overhead to the total execution time – Reducing communication overheard (cpu<->gpu)
  • 3. What is GPGPU • General Purpose Graphics Processing Unit (GPGPU) • Very high performance at low cost – 30-100X speedup over CPU • Architecture well suited for myriads of parallel applications – data parallel processing (SIMD/SPMD) • Integrated programmable unit
  • 4. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP LD/ST LD/ST LD/ST Architecture of GPU- GTX 650 SFU SFU SFU • SP or streaming processor (core)-192 • LD/ST loads and stores • SFU handles cos() , sin () , log() , exp () , sqrt()
  • 5. Architecture of GPU- GTX 650 • Registers hold value , 65536 at max , fast memory access • Warp Scheduler unites groups of instructions (threads) and dispatches them to the SPs through Dispatch Unit. Shared memory / L1 Cache(read-only)
  • 7. Application: Adaptive Merge Sort [1] • Works in three (3) steps Partitioning data set into sub lists or nodes based on order Formulate the nodes in ascending order Merge all nodes
  • 8. Serial Implementation • Experiment with: N = 7,168 random numbers • After partitioning: – total P= 2968 nodes with 1491 nodes in descending order – All descending order nodes converted to ascending order • Then the one step of merging two nodes commenced in three (3) steps: – determine the node, in which the selected value resides – determine the position in newly merged node and – update old nodes information after introduction of a newly merged node – The merging process will repeat for every data item on each level of merging until all fragments are converged into one.
  • 9. How Many Levels Merging to be recurred equal to the number of the merge tree height. As we got 2968 nodes the height must be 11. So, the merge function would be called (11×7168) = 78,848 times. The total time for calculation = 0.161 sec. Merge function needs time for execute once = 2.04 µs (in theoretical)
  • 10. Bottle neck Q: What will happen to merge billions of data with millions of node? Answer: The merge function has to be called billions of times and would require more than hours to calculate. Necessity of parallel computation execute multiple merging operations at the same stride of time (in parallel) and reduce consumption time
  • 11. Implementation In GPU – Threads – Blocks and Grids – Kernel Blocks and Grids
  • 12. Implementation In GPU Kernel • CUDA C allows programmer to define C functions, called kernel, that will execute • times equal to thread numbers specified when called under Host or device function. – Kernel function is defined by using __global__ declaration identifier – Number of threads and blocks specified inside “<<<…>>>”execution configuration syntax
  • 14. Reversing Descending nodes 7 3 13 9 0 Node 0 Node 2 THREAD 0 THREAD 1
  • 15. Thread 0 3 0 7 1 4 2 6 3 9 4 15 5 0 6 9 7 13 8 2 9 2 10 3 11 New position = Position in own node of the data in current index + Number of data in high node <= the data in current index Node 0 Node 1 Node 2 Node 3
  • 16. Thread 0 0 1 2 3 4 5 3 0 7 1 4 2 6 3 9 4 15 5 0 6 9 7 13 8 2 9 2 10 3 11 Node 0 Node 1 Node 2 Node 3 Node 0
  • 17. Thread 2 3 0 7 1 4 2 6 3 9 4 15 5 0 6 9 7 13 8 2 9 2 10 3 11 New position = Position in own node of the data in current index + number of data in high node is less than the data in current index Node 0 Node 1 Node 2 Node 3
  • 18. Thread 2 3 0 7 1 4 2 6 3 9 4 15 5 0 1 2 3 4 5 0 6 9 7 13 8 2 9 2 10 3 11 Node 0 Node 1 Node 2 Node 3 Node 0
  • 19. Parallel implementation After all 12 thread run in parallel (After 1st level merging) 0 6 2 7 2 8 3 9 9 10 13 11 Node 1 After 2nd level merging 3 0 4 1 6 2 7 3 9 4 15 5 Node 0 6 6 7 7 9 8 9 9 13 10 13 11 0 0 2 1 2 2 3 3 3 4 4 5 Node 0
  • 20. Data GPU CPU 4096 0.006472 0.031 7168 0.013 0.21 10240 0.0258 0.205 13312 0.0298 0.345 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 4096 7168 10240 13312 CPU GPU 0 0.1 0.2 0.3 0.4 4096 7168 1024013312 CPU GPU GPU vs CPU time comparison
  • 21. Static Gridsize and Variable Blocksize Nearly similar execution time for equal Gridsize and Variable Blocksize (max 1024)
  • 22. 0 5 10 15 20 25 1024 X 4 1024 X 8 1024 X 16 1024 X 32 1024 X 64 Time of execution (microseconds) Time of execution (microseconds) Variable Gridsize and Static Blocksize Avg Time per iteration = 1.011 micro seconds
  • 23. A Best Fit Situation
  • 24. A Worst Fit Situation
  • 25. Best Fit vs. Worst Fit
  • 26. Execution Time Analysis • Parallel Block # (PB) = Max Thread per block/ User Defined BLOCKSIZE • PB<16 – Parallel Code Loop# = User defined GRIDSIZE/PB*# of SMX • PB> 16 – Parallel Code Loop# = User defined GRIDSIZE/16*# of SMX
  • 27. 0 10 20 30 40 50 60 70 2 2 2 2 2 4 8 16 32 64 Theroritical Execution Time Actual Execution Time Error Theory vs. Reality DATA/Iterations Theroritical Execution Time Actual Execution Time Error 1024*4(2 iter) 2.022 3.3 0.38 512*8(2 iter) 2.022 3.3 0.38 256*16(2 iter) 2.022 3.3 0.38 128*32(2 iter) 2.022 3.3 0.38 64*64(2 iter) 2.022 4.192 0.52 32*128(4 iter) 4.044 5.631 0.28 16*256(8 iter) 8.088 8.652 0.065 8*512(16 iter) 16.176 14.8034 0.0927 4*1024(32 iter) 32.352 26.8096 0.2 2*2048(64 iter) 64.704 50.483 0.28
  • 28. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 4K 7K 10K 13K Execution Time Memory copy Time 0% 20% 40% 60% 80% 100% 4k 7k 10k 13k Memcopy time Exexution Time GPU Exec Time vs. Mem Transfer Time Data Execution Time Memory copy Time 4K 0.006472 0.0725 7K 0.013 0.0825 10K 0.0258 0.0975 13K 0.0298 0.125
  • 29. Conclusion • A successful investigation • GPU’s calculation prowess should be harnessed to solve more merging problems • Examples and the deign should be followed to get upper hand before a problem is approached
  • 30. Future Works • Interpolation Merge Sort • More efficiency and better memory handling in updated work • Grid Level parallelism (requires multiple GPU)
  • 31. References • [1] Shamim Akhter et.al., 2010, Sorting N- elements Using Natural Order: A New Adaptive Sorting Algorithm, Journal of Computer Science 6 (2): 163-167.
  • 32. THANK YOU FOR YOUR PATIENCE
  • 33. Q/A