As Real Time Analytics built on Spark Streaming framework gains increasing adoption and deployment in the data center, the need for hardware accelerators to scale for performance and offer lower latency for certain applications is becoming increasingly apparent. Recently reconfigurable accelerators based on FPGAs (Field Programmable Gate Arrays) have been proposed to accelerate the analytics workloads with support for low-latency streaming. This is becoming a feasible and attractive option now as major CSPs (Cloud Service Providers) announce deployment of FPGAs- as-a-Service (FPGAaaS).
However there are certain challenges in using FPGA accelerators in the cloud that first must be overcome: (1) Development of FPGA accelerators is challenging because of the lack of mature high level language support and standard accelerator interfaces; (2) Using hardware accelerators from high level application frameworks like Spark require special consideration for the efficient transfer of data to and from the application; and (3) Managing the FPGA accelerators for sharing across different applications is limited by the lack of a common resource sharing paradigm.
In this session, we’ll present a runtime framework that addresses these challenges. The framework will present highly optimized FPGA based accelerated functions as a service (AFaaS) to the application. The framework will demonstrate seamless integration of accelerators into existing Big Data Frameworks using Java, Scala, Pyton or other high level language APIs. The framework will also support the transfer of streaming data to/from the FPGA to the application and the sharing of accelerator libraries across different applications. Key takeaways:
This paper demonstrates the feasibility of accelerating Real Time Analytics applications using Spark Streaming with accelerated functions based on FPGAaaS to deliver higher performance efficiencies at lower latencies.
The paper will provide performance data for Spark Streaming applications using our solution compared to current native software implementations.
2. Agenda
• Using Spark Streaming for Real Time Analytics
• Why FPGA : Low Latency and High Throughput
– Inline Processing
– Offload Processing
• Challenges in Using FPGA accelerators
• Megh Platform
– Arka Runtime
– Sira AFUs
• Demo Applications
• Conclusion
2#HWCSAIS17
3. Using Spark Streaming with ML / DL
for Real Time Analytics
ETL Data Processing
ML DLStreams
Application
Social
MediaOperations
Transportation
Marketing
Sensors
Web
Queries
Alerts
Analysis
3#HWCSAIS17
4. Real Time vs. Batch Insights
Real
Time
Secs Mins Hours Days Months
Time
ValueofDatatoDecisionMaking
Information Half-
Life in Decision
Making
Time Critical
Decisions
Traditional “Batch”
Business Intelligence
4#HWCSAIS17
Predictive/
Preventive
Actionable
Reactive
Historical
5. Real Time Insights
Hard Real
Time
Regular
Trading
Fraud
Prevention
Edge
Computing
Dashboard
(Inference)
Operational
Insights
< 1 us 10s us ms 10s ms seconds100s ms
5#HWCSAIS17
6. Real Time Analytics platform:
using Heterogeneous CPU+FPGA computing
Data Processing
CPU+FPGA Platform
Social
MediaOperations
Transportati
on
Marketing
Sensors
Web
Queries
Alerts
Analysis
Batch Mode
Real Time
Mode
Public Cloud Private Cloud Edge Cloud
Application
6#HWCSAIS17
7. In-Line Stream Processing:
using heterogeneous CPU+FPGA platform
7#HWCSAIS17
Worker Node
Executor
Filter #1
Task
System NIC
Worker Node
Executor
FPGA NIC
Filter # 1
FPGA
FPGA terminates Network and dynamically chains filters to provide
pre-processed / low latency DStreams to SPARK apps transparently
Filter #2
Task
MLLib
Task
MLLib
Task
Filter # 2
9. In-Line Stream Performance
9#HWCSAIS17
Lower Latency Higher Throughput
Source: An FPGA Based Low Latency Network Processing for Spark Streaming, K. Nakamura et.al.
Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
10. Off-load Processing (ML/DL):
using heterogeneous CPU+FPGA platform
10#HWCSAIS17
Worker Node
SQL
Executor
Task
DLLib
Task
Worker Node
Executor
SQL
Task
DLLib
FPGA
Accelerate ML/DL algorithm transparently by providing
SPARK bindings to FPGA implementations of ML/DL libraries
11. Off-load ML/DL Processing:
FPGA Architecture
11#HWCSAIS17
Source: CAN FPGAS BEAT GPUS IN ACCELERATING NEXT-GENERATION DEEP LEARNING?
The Next Platform, March 21, 2017
15. Megh Platform:
abstracts the complexity of the FPGA
Packet RX
Streaming
Functions
ML / DL
Functions
Packet TX
FPGA
FPGA Driver
Arka Runtime
Java / C++ Library Adaptors
Other App Frameworks
Sira Accelerator Function Units (AFU)
CPU
In-line
Processing1
Off-load
Processing2
Application
Application:
• uses standard APIs
• And/or custom APIs
Arka Runtime:
• FPGA
management
• SW fallback
• Expose AFaaS
Sira Accelerators:
• Downloaded at
Runtime
• Bare Metal or
Exposed to VMs
via VMM
Infrastructure
Components
Megh
Components
15#HWCSAIS17
16. Virtualized Real Time Analytics Stack
16
#HWCSAIS17
zzz
CPU
FPGA Kernel Driver
VMM VFIO (or Windows equivalent) or PCIe passthrough
Spark Driver/Task
Custom Package/Lib
Arka JNI Access
Utilities: Resource Manager, Scheduler, etc.
ML Package/Lib
ML adapter
. . .
Megh Arka JAVA/SCALA
Arka Runtime
Low Level FPGA Access Lib
VMs
JVMs
JVM
Threads
Application:
• uses standard APIs
• And/or custom APIs
Runtime:
• FPGA management
• SW fallback
• Expose AFaaS
Accelerators:
• Downloaded at
Runtime
• Exposed to VMs via
VMM
FPGAs
Shell
AFU AFU…
17. In-Line Processing:
Smart rx/tx adaptor architecture
17#HWCSAIS17
CPU
FPGA
Kernel
Space
User
Space
Spark DStream Adapter
DMA (VirtIO)
Packet
Processor Filters
Streaming
Processor
Arka Runtime
FPGA Kernel Driver
Shell
Infrastructure
Components
Megh
Components
• Packet Processor: Intercepts
network packets destined to
Spark
• Filters: Performs data
cleaning, re-size, layout
transforms (ETL operations)
• Streaming Processor:
Creates D-Stream packets for
Spark
18. public final class JavaSqlNetworkWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");
System.exit(1);
}
StreamingExamples.setStreamingLogLevels();
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("JavaSqlNetworkWordCount");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
// Create a JavaReceiverInputDStream on target ip:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> words = lines.flatMap(Split2Words());
..
}
..
}
18#HWCSAIS17
Inline sample implementation
CPU IMPLEMENTATION
1. Sets up the DStream CPU adapter
connected to System NIC.
2. Configure IP/port on CPU NIC
3. etlLibCPU.jar (CPU implementation)
• split2Words()
• spilt2Sort()
• split2Count()
FPGA IMPLEMENATAION
1. Sets up the DStream FPGA adapter
connected to FPGA NIC.
2. Configures IP/Port on FPGA NIC
3. etlLibCPU.jar(FPGA implementation)
• split2Words()
• spilt2Sort()
• split2Count()
FPGA is setup to stream and filter data -
before passing it to SPARK as DStream object.
* Full implementation
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaSqlNetworkWordCount.java
1
2
3
19. Off-load Processing:
Low latency off-Load of ML/DL libraries
19#HWCSAIS17
CPU
FPGA
Kernel
Space
User
Space
Spark DStream Adapter
DMA (VirtIO)
ML Libraries DL Libraries
FPGA Kernel Driver
Shell
Infrastructure
Components
Megh
Components
Arka Runtime
Inter-FPGA Network
• Machine Learning Libraries:
Optimized libraries for K-
Means, SVM, etc.
• Deep Learning Libraries:
Optimized libraries for DNN
based inference engines.
• Inter-FPGA Network: FPGA
network for sharing FPGA
resources for larger DNN
topologies
20. public class JavaKMeansExample {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("JavaKMeansExample");
JavaSparkContext jsc = new JavaSparkContext(conf);
..
// Cluster the data into two classes using KMeans
int numClusters = 2;
int numIterations = 20;
KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters,numIterations);
..
double cost = clusters.computeCost(parsedData.rdd());
System.out.println("Cost: " + cost);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(parsedData.rdd());
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
..
jsc.stop();
}
}
20#HWCSAIS17
Offload Sample Implementation
mlib.jar
(CPU library implementation)
• KmeansModel.train()
• KmeansModel.computeCost()
mlibFPGA.jar
(FPGA accelerated library implementation)
• KmeansModel.train()
• KmeansModel.computeCost()
CPU and FPGA share the same function signature -
providing application transparent acceleration by using FPGA library
* Full implementation:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaKMeansExample.java
21. public static void main( String[] args ) throws Exception {
System.out.println(" Java: NumAdd Spark Demo.n");
Long total = null;
SparkConf sparkConf = new SparkConf().setAppName( “NumAdd“ );
JavaSparkContext ctx = new JavaSparkContext( sparkConf );
JavaRDD<String> lines = ctx.textFile( args[0], 1 );
JavaRDD<Long> sums = lines.map( new sumOneString() );
total = sums.reduce( (a,b) -> (a+b) );
System.out.println( "Total is -> " + total );
ctx.stop();
}
21#HWCSAIS17
numAdd Demo:
Implementation details
numAdd is slight variation of the popular WordCount Sample
where numbers in the files are parsed and added up using SPARK
22. Accelerated Operation:
sumOneString
AFU.Factory fpgaFactory = new AFU.Factory();
AFU wc = fpgaFactory.createAFU("meghna");
TransferBuffer inbuf = wc.getTransferBuffer( input1.length() );
wc.queueInputBuffer( inbuf );
// Reuse buffer 1 for the output. AFU design ensures this is safe.
wc.queueOutputBuffer( inbuf ); // Arka permits it.
wc.startFunction(); // The real work starts here
TransferBuffer obuff = wc.waitOnOutputQueue();
return ( obuff.getByteBuffer().asLongBuffer().get(0) );
Instantiate AFU as a Service. Enables multiple distinct
implementations to co-exist and be selected dynamically:
specifically, an FPGA implementation and a CPU-based
fallback implementation.
Buffer Queue based model
• (Register interface available but not shown)
AFU optimized Transfer Buffers allow for:
• Zero copy to HW. And efficient access.
• Efficient access from Java/Scala
• AFU specific implementation.
• May use direct byte buffers, SVM, Netty, Apache Arrow
etc…
Start operation.
22#HWCSAIS17
Wait for results in output queue.
23. Demo: NumAdd Offload Profiling
23#HWCSAIS17
0
50
100
150
200
1M 2M 4M
ExecutionTime
(s)
FileSize
NumAdd
FPGA Offload Spark Streaming
* Executor/task on the worker node restricted to 1 thread
24. In Summary….
• Megh CPU+FPGA platform optimized for Real
Time Analytics
• Arka Runtime supports different streaming
frameworks
• Sira AFUs deliver low latency and high
throughput for inline and offload processing
24#HWCSAIS17