Sudhir Menon, Founder and COO of SnappyData explains how you can tackle Data Gravity, Kubernetes, and strategies/best practices to run, scale, and leverage stateful containers in production.
5. A Spark Based Big Data Analytics Platform
5
Spark API
(Streaming, ML, Graph)
Transactions
, Indexing
Full SQL HA
DataFrame,
RDD, DataSets
RowsColumnar
IN-MEMORY
Spark Cache
Synopses
(Samples)
Unified Data Access
(Virtual Tables)
Unified CatalogNative Store
SNAPPYDATA
HDFS/HBAS
E
S3
JSON, CSV,
XML
SQL db Cassandra MPP DB
Stream
sources
Spark Jobs, Scala/Java/Python/R API, JDBC/ODBC, Object API (RDD, DataSets)
GemFire
6. We transform Spark from this…
6
Deep Scale,
High Volume
MPP DB
USER 1 / APP 1
SPARK
MASTER
Spark Execution (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
USER 2 / APP 2
SPARK
MASTER
Spark Execution (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
HDFS
SQL
NoSQL
• Cannot update
• Repeated for each
User/APP
Bottleneck
7. … Into “an always-on hybrid database !
7
Deep Scale,
High Volume
MPP DB
HDFS
SQL
NoSQL
HISTORY
Spark Execution (Worker)JVM
- Long running
Framework for
streaming
SQL, ML…
Spark
Driver
IN-Memory
ROW + COLUMN
Start with
Indexing
Store
- Mutable,
- TransactionalSPARK
Cluster
JDBC
ODBC
Spark Job
Shared Nothing
Persistence
8. Architecture
8
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
TXN
Synopsis Data Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC
9. 9
Continuous
replication
Join with
Hadoop
NoSQL
Rich SPARK APIs
Stream window
Spark
Transform
(Data Prep)
- Apps/BI Clients execute ad-hoc Join/aggregation queries on multiple NoSQL stores
Live Analytics WithOut The Need For Pipelines
In-memory
Row-Column
Tables
Virtual Tables
NoSQL Connectors SQL
Pull history
on Demand
Continuously
summarize
- No need to do expensive pre-aggregations on large data sets
- Analytics on current, moving data
- Built-in Spark ETL to enrich data
20X faster than Spark, 100-1000X faster than Spark-Cassandra
Micro
Service 1
Micro
Service 2
Micro
Service 3
Session state
Profiles
Orders
10. Use-case Patterns
•Real-time Analytics operational DB
• Move from traditional cubes to distributed in-memory for real-time
•Streaming with Interactive Analytics
• Stream joins with history/context
• Tableau/SpotFire/Zeppelin based interactive analytics
•Interactive exploratory analytics
• Patterns, Top-K, Trends at Google like speed
11. Snappy on PKS – Cloud Neutral Containerized Analytics
Platform
In-memory redundancy and HA
provided by SnappyData
Pod redundancy and restarts
provided by Kubernetes
VM redundancy and restarts
provided by PKS
12. Steps To Launch A Snappy Cluster On PKS
# Connect to PKS cluster
• pks login -a https://api.pks.snappydata.io -u <uname> -p test123 -k
• pks get-credentials pks-cluster-01
• kubectl config use-context pks-cluster-01
# Update to the latest snappydata chart
• cd <spark-on-k8s-checkout>
• git fetch
• git checkout enable-hive-server
# Start SnappyData cluster and note the external IP addresses of lead and locator
• helm install --name snappydata --namespace snappy ./charts/snappydata/
• kubectl get services -n snappy | grep public
13. Steps To Launch A Snappy Cluster On PKS
# Load data into the cluster
• <snappydata-product-dir>/bin/snappy
• snappy> connect client '<locator-public-ip>:1527';
• snappy> run '<path/to/attached/load_CFPB_CC_Data.sql';
# Access SnappyData dashboard at <lead-public-ip>:5050
# Tableau workbook
• Point the workbook to the lead node.
• Launch the workbook by double-clicking it.
14. How We Beat The Competition
Unified Analytics through deep integration into Apache Spark and its eco-
system
High performance through in-memory design center
Support for ETL free live data through CDC integration
Scale and Performance using our Synopsis Data Engine
Cloud neutral, lower TCO analytics platform based on Kubernetes
Standards based approach with support for SQL, ML, & Streaming
16. • Multi-cloud certification
Kubernetes for Multi-cloud support
• Cloud neutral managed cloud offering
DevOps Simplification
• RLS, persistence to cloud, backup/restore using parquet, Dashboard
enhancements, improved performance using SIMD
Enterprise Readiness
• Support for Debizium, certified on major Spark distributions
Eco system support
2019 Themes
19. Smart City – Parking, congestion management
● Sensors power lamp posts
● Optimize parking services
● Optimize energy consumption
● Congestion control
Challenge:
• Hundreds of thousands of sensor streams generating too much data
• Actionable intelligence requires analysis of streams with history
• Ad-hoc Interactive analytics on all this data
20. Smart City – Parking, Congestion Management
Application built using SnappyData’s Unified
Analytics API
Reduced complexity due to fewer moving
parts
20X better performance and far fewer
resources
21. Performance Benchmark
600% faster than Apache Spark in TPC-H (Complex Analytical queries)
Up to 20X faster than Spark on complex joins, aggregations
This slide needs work. The CDC lines should be animated to show stuff constantly coming from the sources into Snappy. And on top of these boxes, there needs to be the following
Some visual dashboards
An arrow sending Spark jobs into the cluster
Snappy store and Spark Executor share the same process space and JVM memory
Reference based access
– zero copy