12. 12
BIG DATA PLATFORMS
Data Scientists &
Developers
Performance
Engineering
Acceleration Programming Model
Inhibitors for Hardware Acceleration
Data Science Programming Model
Focus on Analytics
Focus on
Microarchitecture
Programming Model Gap
Skills Gap
Many-Cores FPGASmartSSD
13. 13
Cross platform
Hybrid acceleration
Intelligent, automatic
computation slicing
Zero code change
Dataflow Adaptation Layer
Dataflow Compiler
Hypervisor
HYPER-ACCELERATION
2X to 10X acceleration
BIG DATA PLATFORMS
Bigstream Hyper-acceleration Layer
Many-Cores FPGASmartSSD
15. 15
Bigstream Seamless Acceleration of Apache Spark
Executor Node
Dataflow
Compiler
Resource
Manager
Catalyst
Cluster
Management
Master NodeClient
Application
Big Data
Platform APIs
Application
Commands
Node Manager
Spark Task
Y
Bigstream Hypervisor
Executors
N
Accelerate?
Physical
Plan
Tasks (Normal/Hyper-accelerated)
HW Accelerator
TemplatesAccelerated Tasks
Resource Management
Messages
Application
Master
Dataflow
Adaptation
Many-Cores
FPGASmartSSD
16. 16
AWS F1 TPC-DS Speedup Results vs Spark
Time(s)
Lowerisbetter
CPU: AWS F1 instance with 8 vCPU with one Xilinx VU9P FPGA
~4x on average
~5x scan heavy
Query Number
17. 17
TPC-DS Scan Heavy Query - Cluster Results
Query Number
Processor: Intel Xeon Gold 6152 : 22 Cores with one SmartSSD per node. Memory: 200G per node
Spark Config: One master 4 executor nodes. 6 spark cores per node.
~4x on average
~6x scan heavy
Time(s)
Lowerisbetter
18. 18
*SmartSSD end-to-end speedup (vs. standard SSD) on 20 GB demo data set
Find “Annoying” flights
with >10 minute delay
from scheduled departure
Query 1: Create heatmap of number of annoying
flights on US map in last 5 years
Query 2 : Create heatmap of number of annoying
flights in Bay Area since 2000
4x Faster Spark Queries on Microsoft Azure
Queries
Results
Data:
Flights, Planes,
Airports, Airlines
Data:
Flights, Planes,
Airports, Airlines
51 sec/13 sec = 3.9x faster
49 sec/11 sec = 4.4x faster
20. UNCLASSIFIED
High Performance Analytics at Scale -- Before ETL and Indexing
Neil Tender, BlackLynx, Senior Research Engineer
www.BlackLynx.tech
October 2, 2019
21. UNCLASSIFIED
The Big Data Problem
Big Data:
We’re generating data faster than ever
Over 90% of all the world’s data was generated in the last two years
Over 175 ZB of data per year by 2025
Volume, Variety and Velocity
Traditional approaches require the use of data preprocessing, such
as Extract, Transform, Load (ETL) for Data Warehousing
The growth rate of actionable data is exponentially outpacing the
growth of analyzed data
Most data is generated at the Edge -- impractical to rely completely
on data center-based approaches
Computational Challenges
Cluster Computing (Apache Spark, Hadoop) does not scale and is
not practical for many use cases
Mobile environments with Size, Weight, and Power requirements
Source: Design World Online
Analytics challenges are forcing new thinking in network, storage, and computing.
22. UNCLASSIFIED
BlackLynx Value Proposition
BlackLynx Enables High Performance Analytics -- without first requiring ETL and Indexing
• High volume/velocity source data is “thinned” to manageable size of useful data in real-time using FPGA/CPU
heterogenous high performance compute
Results can then be fed into traditional data pipeline with ETL/Indexing
Preserves ability to store raw data and perform post-analysis on complete data source
• Supports wide variety of data formats:
unstructured and structured text, PCAP, geospacial, wide-area video and imagery
• Powerful BlackLynx APIs allow chaining of analytics primitives to perform complex searches and analytics
• BlackLynx technologies work together with your preferred visualization tools and applications to supercharge
the speed and capabilities of analytics
23. UNCLASSIFIED
BlackLynx Solutions
SearchLynx - text search and pattern matching and analytics
Complex queries including fuzzy, regex, and geolocation
searches
Semi-structured (XML, JSON, CSV) and unstructured data
CyberLynx - PCAP/network forensic analysis on raw files
Layer 2-4 tags, coupled with SearchLynx to search payloads
VisionLynx - object detection/recognition
Wide Area Imagery still/video
Uses accelerated DNN inferencing techniques
SignalLynx – accelerated processing of signals
Integrated with GNU Radio
25. UNCLASSIFIED
Example: BlackLynx Solution as a Splunk Enterprise App
• Extend Splunk Enterprise via “Apps” to
integrate BlackLynx software technology
and search all the raw data for cyber,
performance, and compliance purposes
• In parallel with Splunk ingest, direct all data
(PCAP for example) to BlackLynx servers
and provide high performance forensics
while reducing Splunk storage costs
• Integrate with Splunk’s 24 hour real-time
monitoring with BlackLynx raw data, 7 layer
visibility to identify and resolve issues faster
• Create opportunities for future machine
learning by fully analyzing the machine
generated data
Packet
Capture
Server
BlackLynx
Server
RAW Storage
Repository
10-100 Gbps
Network
Data
Saved PCAP/JSON/CSV
XML/Unstructured files
BlackLynx Splunk App >
for Alerts & Full Analytics
Splunk > Ingestion of PCAP,
netflow, active triggers, etc.
Bro logs / machine
data
3rd Party Applications Using
RESTful or ODBC/JDBC Interfaces
Future machine learning by fully analyzing
the machine generated data
Ability to search ALL the data enables improved visibility to
answer the hard questions while not raising Splunk license
costs
More Efficient Triage while reducing TCO
Enable automation methods to accelerate event detection
through the elimination of ETL and indexing
Discover events faster
Leverage all the Splunk capabilities while adding BlackLynx performance and high end search
capabilities (fuzzy searching, regular expressions, raw PCAP, etc.) to handle the growth in machine data
26. UNCLASSIFIED
Splunk Powered by BlackLynx Performance Examples
• The DNS log (2 GB) and the PCAP files (15.6 GB) are from the U.S. National CyberWatch Mid-Atlantic Collegiate Cyber Defense Competition (MACCDC) dataset
• The tre-agrep tool was co-authored by Udi Manber, one of the great names in contemporary Computer Science and author of the well-regarded textbook Introduction to Algorithms: A Creative
Approach, which to this day enjoys wide use in Computer Science curricula worldwide
• TSHARK Search is doing the filter parameter(ip.dest) on 16 files (serially). The TSHARK Decode is only the time to build the decoded files (parallel processes) and does not include any filter time
27. UNCLASSIFIED
Wide Range of Hardware Platforms
Cloud
• Ultimate in
scalability
Edge
• Small form factor (SWaP) for
mobile, space, aeronautical
• Ruggedized/portable
environments
On-Premises
• High performance, dual-socket
servers
• Flexible compute/storage
configurations
31. X E L E R A A C C E L E R A T I O N S O F T W A R E
04.10.2019 31
• Analytics microservices
• Deterministic latency
32. X E L E R A A C C E L E R A T I O N S O F T W A R E
04.10.2019 32
Hard
real-time
Actionable Reactive Historical
milliseconds seconds minutes hours days
Real-time Batch
33. X E L E R A A C C E L E R A T I O N S O F T W A R E
04.10.2019 33
Hard
real-time
Actionable Reactive Historical
milliseconds seconds minutes hours days
Real-time Batch
34. S A P - B I A N A L Y T I C S - F R A U D D E T E C T I O N
04.10.2019 34
Transaction
request
Collected
customer
behavior
Outlier?
No FraudFraud
• More detections
• Fewer servers
• Lower operational costs
microservice
Web page
35. S A P - B I A N A L Y T I C S - F R A U D D E T E C T I O N
04.10.2019 35
Credit card transaction frauds detection:
• 145751 data-points
• 74 features per point
• Clustered into 2000 partitions
0
50
100
150
200
250
300
Processing time [s] (*)
Xelera Analytics
OTC fp1c.2xlarge
SAP PAL
OTC s1.2xlarge
(*) Benchmarks obtained with SAP HANA PAL
on OTC; other recommender engine software
may deviate from these results
36. S P A R K - B I A N A L Y T I C S - R E C O M M E N D A T I O N E N G I N E
04.10.2019 36
Web page
Web service
microservice
Prediction
Ask prediction
• More recommendations per second
• Fewer servers
• Lower operational costs
37. S P A R K - B I A N A L Y T I C S - R E C O M M E N D A T I O N E N G I N E
04.10.2019 37
Real-Time Movie Recommendation:
• 1,000 user requests per second
• 1,682 movies
(Machine Learning models)
• 50 ms round-trip latency constraint
0
5
10
15
20
25
30
35
40
Number of cloud instances (*)
Xelera Analytics
AWS f1.2xlarge
Spark Mllib
AWS c4.8xlarge
(*) Benchmarks obtained with Apache Spark
framework on AWS; other recommender engine
software may deviate from these results
38. A U D I O S T R E A M I N G A N A L Y T I C S - S P E A K E R R E C O G N I T I O N
04.10.2019 38
Neural network
Audio signal representaton & preprocessing
microservice
Audio stream
Speaker
• Support for multiple user sessions connect
asynchronously to the microservices
• Scalable on-demand
• Each request must be completed within a
60ms latency window
39. A U D I O S T R E A M I N G A N A L Y T I C S - S P E A K E R R E C O G N I T I O N
04.10.2019 39
0
10
20
30
40
50
60
70
80
90
Concurrent sessions per
accelerator (*)
Alveo U250 (no batching, multi-DNN-model)
Tesla V100 (no batching, multi-DNN-model)
0
5
10
15
20
25
30
35
40
45
50
Alveo U250 Tesla V100
Single-request latency [ms] (*)
Single-request latency (mean) Single-session latency (max)
(*) Benchmark obtained with Alveo U250 Dell R740 server vs. NVIDIA Tesla V100 architecture on AWS EC2
p3.2xlarge instance. Other recommender engine software may deviate from these results
40. C A L L T O A C T I O N
04.10.2019 40
Join Xelera Analytics microservices
Alveo U200 Alveo U250 Alveo U280
72. Greenplum: Open-Source DW Solution
• Field tested with widespread adoption in Telco,
Financial, Government, Retail, Insurance, …
• ~5% market-share currently, growing slowly
• ~150mm per year
https://discovery.hgdata.com/product/greenplum-database
73. Deepgreen DB: a (much) better Greenplum
More Speed
Between 2 – 15X faster for
complex OLAP queries
while maintaining 100%
compatibility.
More Connected
Dynamically read/write
AWS S3, HDFS, Oracle,
Kafka, etc.
More Intelligent
Integrated in-database
machine learning,
geospatial function, video
decoding and object
classification.
74. Gain Speed by Removing Bottleneck
Abundant Storage
• TB RAM
• NVMe SSD
• Smart SSD
Abundant Network
Bandwidth
• 10, 100 GigE is common
CPU severely limited
• Same old Xeon
• Xeon-Phi is a no-show
The New Bottleneck
75. Core Technology
Accelerate SQL through full exploits
of x86, FPGA & SSD
• JIT code-gen on SQL
• Use FPGA to relief CPU
• SIMD column-store + zonemap
• Performant network interconnect
• Integrated In-database Machine
Learning, Geospatial and Video
Decoding with Xilinx FPGA
Keep pushing compute to data
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4
Speed Up across available HW
76. Customer Use Case
• TELCO – churn analysis, BI, end-user usage application, etc.
• IOT – analysis on self-service data lake, SIM-card life-management
• Smart Cities – video discovery / log discovery
• Internet Company – anti-fraud, customer tagging, BI and reporting
77. FPGA Support
• Alveo U200, U250 – video discovery and log discovery applications
• Alveo U50 – Accelerated Postgres for Analytics
• AWS F1
• Azure
• Samsung Smart-SSD (coming soon)
79. Vision:
To enable customers to achieve significant performance improvement and
cost-savings beyond what traditional methods of computing can provide
About BigZetta
Location:
R&D center in Noida (India)
Business presence in San Jose and Seattle (USA)
Expertise:
Big-Data technologies like , and
Power/Performance/Area Optimized Hardware design
Performance optimization of software applications using Hardware-
Software co-design
81. Why accelerate ?
Most widely used Query Processing Engine in Big Data eco-system
More than 10,000 companies use Hive for their Big Data processing needs
Caters to variety of requirements: data warehouse, ETL, analytics etc.
Hive’s use for BI queries has critical runtime requirements (sub-second)
Provided by all major Big Data vendors:
82. Data Analytics With Hive
Faster
Scale Up
Scale Out
CPU clock-speeds have saturated
Scale Up/Out give diminishing returns
How to get more speed?
83. FPGA Driven Acceleration
Work as co-processor to CPU
Speed-up compute intensive
tasks
Available on all major clouds
(AWS, Azure, Alibaba, Nimbix …) How to
get
benefits
of FPGA in
Hive?
84. CPU Middleware FPGA
bzQAccel
Middleware between
Hive and underlying
hardware
Optimizes query
execution plan suited
for FPGAs
Provides fastest
execution of the plan
on FPGAs
For different queries,
no need to recompile
either the host code or
FPGA kernel
Minimal penalty of
data movement
and Table
Transfer table
data to kernel
Call kernel
computation
Pass result back
to Hive
bzQAccel
Loaded with SQL
operations
Call to
kernel
Query
Call to host
86. Solution: bzQAccel (BigZetta Query
Accelerator)
bzQAccel
No software or query
changes required
4x speed-up of
analytical queries
1-click install over any
Hive distribution
Technology extensible to Spark,
Presto, Impala, Druid ….
87. Availability
Supported Xilinx platforms: Alveo U200, U250 and U280
Whitepaper, datasheet and demo available at
http://www.bigzetta.com/
Trial software available on Nimbix cloud
To request for an evaluation: sales@bigzetta.com
Fill an evaluation checklist to help with qualification
91. www.inaccel.com™
Integrated solution for Application Acceleration
9
1
InAccel Scalable FPGA Resource Manager
Accelerated ML suite
On-premise Cloud
Higher Performance
Up to 16x Speedup compared to
highly optimized libraries
Lower Cost
Up to 4x lower TCO
Zero-code changes
Seamless integration to widely
used frameworks
Easy deployment
Docker-based container for
seamless integration
On-prem or on cloud
Available on cloud and on-prem
92. www.inaccel.com™
InAccel Technology: Coral FPGA Resource Manager
˃ Coral abstracts FPGA resources
(device, memory), enabling fault-
tolerant heterogeneous distributed
systems to easily be built and run
effectively.
9
2
Worlds’ first FPGA Orchestrator:
Program against your FPGAs like
it’s a single pool of accelerators
InAccel Coral
Resource
Manager
InAccel Runtime
- Resource isolation
Applications
FPGA drivers
Serve
r
FPGA
Kernels
“automated deployment, scaling,
and management of FPGAs”
93. www.inaccel.com™
InAccel Docker Service
˃ Sustain FPGA driver
compatibility between the host
and the containers
• discover available resources
• mount/isolate visible devices
‒ forget --priviledged
• resolve library dependencies
93
FPGAs
(Intel/Xilinx)
Server
FPGA
RunTime Host OS
InAccel
Container Runtime Docker engine
App App App
InAccel’s Coral
Device Plugin
containers
95. www.inaccel.com™
Performance evaluation on Machine Learning
˃ Up to 15x speedup for LR ML
(7.5x overall)
˃ Up to 14x speedup for Kmeans
ML (6.2x overall)
˃ Spark- GPU* (3.8x – 5.7x)
˃ F1.4x
16 cores + 2 FPGAs (InAccel)
˃ R5d.4x
16 cores
9
5
r5d.4x
0 500 1000 1500
Logistic Regression execution
time MNIST 24GB, 100 iter.
(secs)
Data preprocessing Data transformation
ML training
15x Speedup
r5d.4x
f1.4x (InAccel)
0 500 1000 1500 2000 2500
K-Means clustering exection time
MNIST 24GB, 100 iter. (secs)
Data preprocessing Data transformation ML training
14x Speedup
*[Spark-GPU: An Accelerated In-Memory Data Processing Engine on
Clusters]
96. www.inaccel.com™
Serverless deployment
˃ Integrated framework for serverless
deployment
˃ Compatible with Kubernetes
˃ Compatible with Kubeless, Knative
˃ Users only have to upload the images on
the S3 bucket and then InAccel’s FPGA
Manager automatically deploy the cluster
of FPGAs, process the data and then store
back the results on the S3 bucket.
˃ Users do not have to know anything about
the FPGA execution.
9
6
Amazon S3 Amazon S3
Cluster of Amazon
EC2 f1 instances
trigger
InAccel FPGA Resource Manager
f1 library of
accelerated
functions
Upload files Download files
Accelerated
function
https://medium.com/@inaccel/fpgas-goes-serverless-on-kubernetes-55c1d39c5e30
98. www.inaccel.com™
Example on scaling to 2 FPGA using the resource
manager for logistic regression
9
8
1.86x speedup using 2 FPGAs
simply by changing a line
inaccel start --fpga=xilinx:0,xilinx:1
You specify how many
FPGAs you want to
use
inaccel start --fpga=all
or
99. www.inaccel.com™
Apache Arrow Summing up
˃ Seamless Arrow integration
˃ Page-aligned
columnar format
˃ Native memory map
˃ Zero-copy operations
99
App1
Coral FPGA
Resource Manager
FPGA Cluster
App2 App3
columnar
format
structure
DRAM
100. www.inaccel.com™
Try it now for free
˃ Get now for free a license for the Coral Resource Manager
https://inaccel.com/license/
Scale Xilinx’s cores (compression or OpenCV)
‒ https://docs.inaccel.com/latest/develop/examples/
˃ Use the open-source ML cores:
https://github.com/inaccel
100