Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Qi Xie (qi.xie@intel.com)
Hao Cheng (hao.cheng@intel.com)
Quanfu Wang (quanfu.wang@intel.com)
FPGA-BASED ACCELERATION
ARCH...
LEGAL NOTICES
• You may not use or facilitate the use of this document in connection with any infringement or other legal ...
About me
• Software engineer from Intel Big Data Engineering Spark team
• Focused on Spark optimization for Intel Architec...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
What is an FPGA?
• Field Programmable Gate Array
6
‒ Configurable Logic Blocks (CLB)
‒ Embedded Memory
‒ Digital signal pr...
7
Why FPGA?
a
b
c
y
y a b c  
Truth Table
a b c y
0 0 0 1
0 0 1 0
0 1 0 1
0 1 1 1
1 0 0 1
1 0 1 0
1 1 0 1
1 1 1 1
Progr...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
Discrete and Integrated FPGA platforms
9
Intel Accelerator Abstraction Layer(AAL)
10
FPGAHardware
End User Programming Interfaces
11
FPGACPU
User Application
CPU
Infrastructure IP
(UPI, PCIe*, HSSI, FPGA Management)
FPGA...
Traditional FPGA Development Approach
Kernels
exe
AFU
Bitstream
SW
Compiler
OpenCL
Compiler
HDL
SW
Compiler
exe AFU
Bitstr...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
Workload Introduction
14Intel Confidential
The test case is from a customer and it utilizes SQL query to get the accountin...
Workload code snippet
Function Count
Max 13
Sum 155
Substr 329
Case 133
Implicit Data type cast (String to Double) n/a
Tot...
16Intel Confidential
SQL Query Physical Execution Plan
Two stages and with a shuffle(cross the data in network), the map s...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
Benchmark H/W Setup
18Intel Confidential
In a single server for profiling and performance evaluation.
• MCP(Skylake-FPGA M...
19Intel Confidential
Baseline Profile - CPU, The Bottleneck
• PAT(Performance Analysis Tool) shows CPU is heavily utilized...
20Intel Confidential
Baseline Profile - CPU, The Bottleneck, Contd.
• From the VisualVM map task’s CPU breakdown we can se...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
Arch Overview – Typical SQL Query Operators
2121aaIntel Confidential
This POC Target
JVM
Spark
Spark FPGA Adaptor
Native
HW
InternalRow
to FPGA Batch
FPGA Batch
to InternalRow
FPGA Java Wrapper
FPGA Driver
F...
null bit set(1 bit/field) values(8 bytes/field) variable length portion
4 bytes(TIME_ID) …… 64 bytes(For 4xCL alignment)
…...
Spark
Engine
Unit
Engine
Unit
Engine
Unit …DMA
RX
DMA
TX
Output BufferInput Buffer
CPU
FPGA
FPGA
Adapter & Driver
Data Sou...
Arch Overview – SQL Engine Micro Architecture
26
• Every SQL Expression Evaluation engine is configurable.
• Every engine ...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
• The FPGA accelerated version significantly reduced the total execution time, from 86
seconds(baseline) to 44 seconds in ...
• The FPGA accelerated version reduced the CPU time in expression evaluation,
from 66.7%(baseline) to 6.6-% in Map stage.
...
Outline
• What’s an FPGA
• Intel FPGA Platform
• Workload & Benchmark Introduction
• Baseline Profile - Hotspot Analysis
•...
Future Works
• Fully Configurable FPGA SQL Acceleration Engine
• In this PoC, we identified the SQL expression patterns ma...
qi.xie@intel.com
hao.cheng@intel.com
quanfu.wang@intel.com
Thank You
Próxima SlideShare
Cargando en…5
×

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

4.099 visualizaciones

Publicado el

In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.

Publicado en: Datos y análisis
  • Sé el primero en comentar

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

  1. 1. Qi Xie (qi.xie@intel.com) Hao Cheng (hao.cheng@intel.com) Quanfu Wang (quanfu.wang@intel.com) FPGA-BASED ACCELERATION ARCHITECTURE FOR SPARK SQL
  2. 2. LEGAL NOTICES • You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. • No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. • Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. • This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. • The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. • Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548- 4725 or by visiting www.intel.com/design/literature.htm. • Intel, the Intel logo, Intel® are trademarks of Intel Corporation in the U.S. and/or other countries. • *Other names and brands may be claimed as the property of others. • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. • Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. • Copyright © 2017 Intel Corporation. 2
  3. 3. About me • Software engineer from Intel Big Data Engineering Spark team • Focused on Spark optimization for Intel Architecture 3
  4. 4. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 4
  5. 5. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 5
  6. 6. What is an FPGA? • Field Programmable Gate Array 6 ‒ Configurable Logic Blocks (CLB) ‒ Embedded Memory ‒ Digital signal processing (DSP) blocks ‒ I/O pads ‒ Hard IP(PCIe, DDR, GigE, etc )
  7. 7. 7 Why FPGA? a b c y y a b c   Truth Table a b c y 0 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 Programmed LUT 1 0 1 1 1 0 1 1 MUX y a,b,c LUT Required Function ‒ Reconfigurable architecture CLB consists of LUTs. LUT is a RAM with data width of 1 bit. The contents are programmed at power up. ‒ Low-power, energy efficiency, compared with CPU/GPU Extreme degree of customizations, Well positioned for High performance and providing flexibility
  8. 8. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 8
  9. 9. Discrete and Integrated FPGA platforms 9
  10. 10. Intel Accelerator Abstraction Layer(AAL) 10 FPGAHardware
  11. 11. End User Programming Interfaces 11 FPGACPU User Application CPU Infrastructure IP (UPI, PCIe*, HSSI, FPGA Management) FPGA Runtime Software (Accelerator Abstraction Layer) FPGA IP (Acceleration Function Unit) Intel-Provided Infrastructure USER SOFTWARE INTERFACE User Developed Application Specific Functions UPI/PCIe HSSI = New blocks that simplify code development. CORE CACHE INTERFACE Intel® Confidential
  12. 12. Traditional FPGA Development Approach Kernels exe AFU Bitstream SW Compiler OpenCL Compiler HDL SW Compiler exe AFU Bitstream HDL Programming Syn. PAR AAL Software Blue Bitstream CPU FPGA Green Bitstream OpenCL Emulator Application Host AFU Simulation Environment (ASE) C OpenCL Programming ASE from Intel AAL from Intel Altera® Quartus Prime Pro OpenCL BSP AAL Software Blue Bitstream Green Bitstream Application CPU FPGA 12
  13. 13. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 13
  14. 14. Workload Introduction 14Intel Confidential The test case is from a customer and it utilizes SQL query to get the accounting summaries by USER_ID on a big table. The SQL query contains heavy expression evaluations. Accounting Big Table: TIME_ID MBUSER_ID OPER_TID SUM_TIMES CHARGE1 … 20140407 2700007679977 5B013363363w 3 0 … 20140407 2704012998344 31011G13iG0 48 57180 … 20140407 2704040114238 31Q11512ZT0 1 180 … 20140407 2700007012466 31011G13iG0 8 52320 … 20140407 2700001523491 1T0311G80610ydH10G00 2 0 … 20140407 2700000765632 310103015G0 1 30 … 20140407 2700007800325 4562210021 1 0 … … 1.6x10^8 Rows 38 Columns  SQL queries to summarize customers consumption characteristics utilizing billing data.  5GB parquet format stored on HDFS, 160 Million rows.
  15. 15. Workload code snippet Function Count Max 13 Sum 155 Substr 329 Case 133 Implicit Data type cast (String to Double) n/a Total 630 // Prepare val parquet = spark.read.parquet ("/mnt/nvme/inputParquet/") parquet.createOrReplaceTempView ("inputTable") // Query A very Long SQL statement, intensive use build-in functions:
  16. 16. 16Intel Confidential SQL Query Physical Execution Plan Two stages and with a shuffle(cross the data in network), the map stage contains file scan, projection and partial aggregation while the reduce stage do further aggregation by merging the partial aggregation results. Stage 1 (Map) • File Scan Read data from source. • Projection Expression evaluation consumes most CPU cycles. • Partial Aggregation Aggregate per partition. Shuffle Stage 2 (Reduce) • Full Aggregation. Tiny task, consumes minor CPU cycles.
  17. 17. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 17
  18. 18. Benchmark H/W Setup 18Intel Confidential In a single server for profiling and performance evaluation. • MCP(Skylake-FPGA Multiple Chips Package) o CPU Intel Xeon Skylake-P, 2Socketsx14Cores@2.8GHz, 56Hyper Threads o FPGA 1xArria10 GX, 427,200ALM, 8MB RAM (10AX115U3F45E2SG) o DMA Channels 1xUPI (80Gbps) • Memory 384GB, DDR4@2133 MHz • Disk 1xIntel SSD P3700, 1.6TB, SR:2800MB/s, SW:1900MB/s, RR:450K IOPS, RW:150K IOPS
  19. 19. 19Intel Confidential Baseline Profile - CPU, The Bottleneck • PAT(Performance Analysis Tool) shows CPU is heavily utilized (assigned 54/56 Virtual Cores to Spark). The total query execution time is 85 seconds. Note: We started measurement from the 2nd run(the 1st run is to warm up data Linux file system cache), so no disk access bandwidth in general. Reduce Stage does very simple aggregation and takes minor CPU.(~1s) *For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  20. 20. 20Intel Confidential Baseline Profile - CPU, The Bottleneck, Contd. • From the VisualVM map task’s CPU breakdown we can see the projection consumes 66.7% CPU. Projection takes 66.7% of CPU *For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  21. 21. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 21
  22. 22. Arch Overview – Typical SQL Query Operators 2121aaIntel Confidential This POC Target
  23. 23. JVM Spark Spark FPGA Adaptor Native HW InternalRow to FPGA Batch FPGA Batch to InternalRow FPGA Java Wrapper FPGA Driver FPGA FPGA Project Pattern Configure DMA Configure Huge Page Memory Pool Computation Starter, Monitor Java Native Interface(JNI) Accelerator Abstraction Layer (AAL) • Spark FPGA Adaptor • Identify the expressions in projection and export to FPGA SQL engine instructions • Data conversions between Spark Internal Rows  FPGA Batches. • FPGA Driver • Configure the SQL Engine Patterns according to the instructions from Spark FPGA Adaptor • Trigger the FPGA computation and collect results • Huge pages memory management • Configure the DMA channel between main memory & FPGA • AAL • FPGA runtime library • low level API to FPGA Driver • FPGA SQL Engine (RTL) • SQL expression pattern units, can be configurable. • DMA RX: FPGA reads input data from main memory. • DMA TX: FPGA writes results to main memory. 23 Arch Overview - S/W Stack DMA RX/TX(RTL) SQL Engine(RTL)
  24. 24. null bit set(1 bit/field) values(8 bytes/field) variable length portion 4 bytes(TIME_ID) …… 64 bytes(For 4xCL alignment) …… 4 bytes(For 4xCL alignment)8 bytes(MBUSER_ID) 8 bytes(MBUSER_ID)FPGA Input Batch FPGA Output Batch Internal Row InternalRow to FPGA Batch FPGA Batch to InternalRow FPGA Java Wrapper FPGA Project 1. Get HugePage wrapped in DirectByteBuffer Internal Rows FPGAInputBatch FPGAOutputBatch Internal Rows 4. Input for Computation 5. Collect computation result 2. Data Conversion 6. Data Conversion 7. Free HugePage wrapped in DirectByteBuffer • Internal Row Spark representation of one record, flexible to represent fixed and variable length fields. • FPGA Input Batch For memory and computation efficiency fields are placed in a sequential physical memory. • FPGA Output Batch Similar as FPGA Input Batch. 3. Engine Configuration, Start 24 Arch Overview - S/W Stack, Contd. 12 bytes(ACC_NBR) Input Output Data Flow Control Flow Spark FPGA Adaptor
  25. 25. Spark Engine Unit Engine Unit Engine Unit …DMA RX DMA TX Output BufferInput Buffer CPU FPGA FPGA Adapter & Driver Data Source Input BufferInput Buffer Output BufferOutput Buffer Engine Unit Engine Unit Engine Unit … Engine Unit Engine Unit Engine Unit … 169 Levels Pipeline Data Flow Control Flow Pattern Configure, Computation Control 25 Arch Overview - Engine Pipeline, Data Flow • Engine Pipeline Spark FPGA SQL Engine is designed as Engine Unit Pipelines. Every Engine Unit plays a single computation, different Engine Units are assembled together(configured by Spark) to perform a complex computation and works in the way of pipeline. A lot of pipelines(say N pipelines) can be constructed to perform N parallel computations, so that in a single FPGA cycle, N records can be digested. • Data Flow Spark pumps Data from Data Source and converts them into the format as FPGA required, and then put them into InputBuffer Array. Then FPGA gets input data via DMA RX and feed them into Engine Pipelines. The results of Engine Pipelines are filled into OutBuffer Array via DMA TX. Finally Spark converts data back in the format of Spark SQL needed.
  26. 26. Arch Overview – SQL Engine Micro Architecture 26 • Every SQL Expression Evaluation engine is configurable. • Every engine contain max four pattern engines. The input data is parallel fed into pattern engine. The final result is the combine of the pattern engine result.  Pattern Engine 1 is configured to evaluate the SQL expression Substr(oper_tid,1,1) IN (‘1’, ‘7’)  Pattern Engine 2 is configured to evaluate the SQL expression Substr(oper_tid, 2, 1) IN (‘o’)
  27. 27. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 27
  28. 28. • The FPGA accelerated version significantly reduced the total execution time, from 86 seconds(baseline) to 44 seconds in the end to end benchmark. Speedup Ratio: 86s/44s => ~2X FPGA: 44s Baseline: 86s Performance Comparison - FPGA vs Baseline 28 *For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  29. 29. • The FPGA accelerated version reduced the CPU time in expression evaluation, from 66.7%(baseline) to 6.6-% in Map stage. Projection with FPGA, less than 6.6% Projection in Baseline, 66.7% 29 Performance Comparison - FPGA vs Baseline, Contd. *For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  30. 30. Outline • What’s an FPGA • Intel FPGA Platform • Workload & Benchmark Introduction • Baseline Profile - Hotspot Analysis • FPGA Acceleration Arch Overview • Performance Comparison • Future Works 30
  31. 31. Future Works • Fully Configurable FPGA SQL Acceleration Engine • In this PoC, we identified the SQL expression patterns manually in frontend and configure them to the FPGA SQL Engine units in runtime; however, we have limit FPGA SQL engines to support some of the typical expression patterns, and arbitrary SQL expression combinations is not supported yet. • More Operators Support • SQL Expression Evaluation in Projection is the first step, and for the other typical operators like Aggregation/Sort/Join probably also can be offload to FPGA. • CPU can also computes the expression evaluation when FPGA resources are fully occupied in computation. 31
  32. 32. qi.xie@intel.com hao.cheng@intel.com quanfu.wang@intel.com Thank You

×